Bi ostat for pharmacy.ppt2
description
Transcript of Bi ostat for pharmacy.ppt2
BIOSTATISTICS
School of pharmacy(COMH 607)
1
1.RESEARCH METHODS
2
1.1.Introduction to Research
What is Research?• A scientific study to seek hidden knowledge• A scientific study to answer a question• A scientific study of causes and effects• A scientific attempt towards new discoveries• A systematic method of inquiry• A logical attempt to find answers to problems• A systematic approach to a (medical) problem
3
Statistical Concept of Research
• Research is a systematic collection, analysis and interpretation of data in order to solve a research question
• It is classified as:– Basic research: necessary to generate
new knowledge and technologies.– Applied research: necessary to
identify priority problems and to design and evaluate policies and programs for optimal health care and delivery.
4
1.2. Types of Epidemiological DesignA. Descriptive studies
• Mainly concerned with the distribution of diseases with respect to time, place and person.
• Useful for health managers to allocate resource and to plan effective prevention programmes.
• Useful to generate epidemiological hypothesis, an important first step in the search for disease determinant or risk factors.
• Can use information collected routinely which are readily available in many places. So generally descriptive studies are less expensive and less time-consuming than analytic studies.
5
• It is the most common type of epidemiological design strategy in medical literature.
• There are three main types: –Correlational –Case report or case series –Cross-section
6
A.1. Correlational or Ecological • Uses data from entire population to compare
disease frequencies – between different groups during the same period of time, or in the same population at different points in time.
• Does not provide individual data, rather presents average exposure level in the community.
• Cause could not be ascertained.
• Correlation coefficient is the measure of association in correlational studies. It is important to note that positive association does not necessarily imply a valid statistical association.
7
Eg.• Hypertension rates and average per capita
salt consumption compared between two communities.
• Average per capita fat consumption and breast cancer rates compared between two communities.
• Comparing incidence of dental cares in relation to fluoride content of the water among towns in the rift valley.
• Mortality from CHD in relation to per capita cigarette sales among the regions of Ethiopia.
8
• Strength: Can be done quickly and inexpensively, often using available data.
• Limitation: – Inability to link exposure with disease. – Lack of ability to control for effects of
potential confounding factors. There may be other things that at the true cause.
– It may mask a non-linear relationship between exposure and disease. For example alcohol consumption and mortality from CHD have a non-linear relationship (the curve is “J” shaped),
9
A.2. Case Report and Case Series
• Describes the experience of a single or a group of patients with similar diagnosis. Has limited value, but occasionally revolutionary.
• E.g. 5 young homosexual men with PCP seen between Oct. 1980 and May 1981 in Los Angeles arose concern among physicians. Later, with further follow-up and thorough investigation of the strange occurrence of the disease the diagnosis of AIDS was established for the first time.
10
• Strength: – very useful for hypothesis generation.
• Limitations: – Report is based on single or few
patients, which could happen just by coincidence. Lack of an appropriate comparison group
11
A.3. Cross Sectional Studies (Survey• Information about the status of an
individual with respect to the presence or absence of exposure and disease is assessed at the same point in time. Easy to do-many surveys are like this.
• For factors that remain unaltered overtime, such as sex, race or blood group, the cross-sectional survey can provide evidence of a valid statistical association.
• Useful for raising the question of the presence of an association rather than for testing a hypothesis.
12
B. ANALYTIC STUDIES • Focuses on the determinants of a
disease by testing the hypothesis formulated from descriptive studies, with the ultimate goal of judging whether a particular exposure causes or prevents disease.
• Broadly classified into two – observational and interventional
studies. – Both types use “controls”. The use of
controls is the main distinguishing feature of analytic studies.
13
B.1. Observational studies• Information are obtained by observation of
events. No intervention is done. Cohort and case-control are in this category.
i. Cohort• Subjects are selected by exposure, or
determinants of interest, and followed to see
• If they develop the disease or outcome interest.
• E.g. Follow 100 children who received BCG vaccination and another 100 who didn’t get BCG vaccination and see how many of them get tuberculosis.
14
• ii. Case Control • Subjects are selected with respect to
presence or absence of disease, or outcome of interest, and then inquiries are made about past exposure to the factor(s) of interest.
• E.g. Take people with and without TB, ask them if they ever had BCG vaccination.
15
B.2. Interventional / Experimental
• The researcher does something about the disease or exposure and observe the changes.
• Investigator has control over who gets exposure and who don’t. The key is that the investigator assign into either group, whether it is done randomly or not.
• Always prospective. • E.g. Assign children randomly to get
chloroquine or not, and see how many develop symptomatic malaria.
16
Description of common terms Statistics- It is the process of scientifically collecting,
organizing, summarizing and interpreting of data, and the drawing of inferences about a body of data when only part of the data are observed.
Biostatistics- It is a special statistics in which the data being analyzed are derived from biological and medical science
Descriptive statistics: A statistical method that is concerned with the collection, organization, summarization, and analysis of data from a sample of population.
Inferential statistics: A statistical method that is concerned with the drawing of inferences/ conclusions about a particular population by selecting and measuring a random sample from the population.
17
Population: Is the largest collection of entities/values of a random variable for which we have an interest at a particular time. Population could be finite or infinite. We can take the whole number of students in a given class (e.g. 100 students) as a population.• Target population: A collection of items
that have something in common for which we wish to draw conclusions at a particular time.
• Study Population: The specific population from which data are collected
18
Sample: It is some part/subset of population of interest. In the above example, if we randomly select 25 students from the 100, we call the former as sample of the class.
Hence, Generalizability is a two-stage procedure: we want to a generalize from the sample to the study population and then from the study population to the target population
19
20
Eg.: In a study of the prevalence of HIV among orphan children in Ethiopia, a random sample of orphan children in LidetaKifleKetema were included.
Target Population: All orphan children in EthiopiaStudy population: All orphan children in Addis AbabaSample: Orphan children in Lideta KifleKetema
Statistical inference: It is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population.
Parameter: It is numerical expression of population measurements E.g. population mean (µ), population variance, population standard deviation, etc
A descriptive measure computed from the data of a population.
Statistic: A descriptive measure computed from the data of a sample.
Statistical data: Information that is systematically collected tabulated and analysis for which the result is interpreted to draw conclusions about the result obtained.
21
• Data: aggregate of variables as a result of measurement or counting.
• Variable: A characteristics that takes on different values in different persons, places, or things. – Dependent variable(response) :variable
(s)we measure as an out come of interest– Independent variable(predictor) :The
variable(S) that determines the outcome
22
Categorical variable: The notion of magnitude is absent or implicit.
– Nominal: have distinct levels that have no inherent ordering.
– When only with two categories, are called binary or dichotomous.Eg. Sex; male or female
– When more than two categories -are called polythumous eg color
– Ordinal: have levels that do follow a distinct ordering.
Eg. severity of pain(mild, moderate severe)
23
Quantitative(numeric) variable: Variable that has magnitude
• Discrete data: when numbers represent actual measurable quantities rather than mere labels. Discrete data are restricted to taking only
specified values often integers or counts that differ by fixed amounts. e.g. Number of new AIDS cases reported
during one year period, Number of beds available in a particular hospital
• Continuous data: represent measurable quantities but are not restricted to taking on certain specific values i.e fractional values are possible. Can use interval (no true zero value) or ratio scale (begins at zero)
– e.g. weight, cholesterol level, time, temperature
24
1.3.Sampling Methods
Sampling• The process of selecting a portion of the
population to represent the entire population. • A main concern in sampling:
– Ensure that the sample represents the population, and
• The findings can be generalized.
25
Advantages of sampling:
• Feasibility: Sampling may be the only feasible method of collecting information.
• Reduced cost: Sampling reduces demands on resource such as finance, personnel, and material.
• Greater accuracy: Sampling may lead to better accuracy of collecting data
• Sampling error: Precise allowance can be made for sampling error
• Greater speed: Data can be collected and summarized more quickly
26
Disadvantages of sampling:• There is always a sampling error.• Sampling may create a feeling of discrimination within
the population.• Sampling may be inadvisable where every unit in the
population is legally required to have a record.
Errors in sampling
1) Sampling error: Errors introduced due to selection of a sample.– They cannot be avoided or totally eliminated.
2) Non-sampling error: - Observational error
- Respondent error- Lack of preciseness of definition- Errors in editing and tabulation of data
27
Divisions of Sampling Methods
Two broad divisions:
A. Probability sampling methods
B. Non-probability sampling
methods
28
1.4.1. Probability sampling
• Involves random selection of a sample
• A sample is obtained in a way that ensures every member of the population to have a known, non zero probability of being included in the sample.
• Involves the selection of a sample from a population, based on chance.
29
• Probability sampling is: – more complex, – more time-consuming and – usually more costly than non-probability sampling.
• However, because study samples are randomly selected and their probability of inclusion can be calculated, – reliable estimates can be produced and
• inferences can be made about the population.
30
• There are several different ways in which a probability sample can be selected.
• The method chosen depends on a number of factors, such as – the available sampling frame, – how spread out the population is, – how costly it is to survey members of the
population
31
Most common probability sampling methods
1. Simple random sampling2. Systematic random sampling 3. Stratified random sampling 4. Cluster sampling 5. Multi-stage sampling
32
1. Simple random sampling(SRS)
• Involves random selection• Each member of a population has an equal
chance of being included in the sample. • To use a SRS method:
– Make a numbered list of all the units in the population
– Each unit should be numbered from 1 to N (where N is the size of the population)– Select the required number.
33
• The randomness of the sample is ensured by: • use of “lottery’ methods • a table of random numbers
– Using computer programes
• Example • Suppose your school has 500 students and you need
to conduct a short survey on the quality of the food served in the cafeteria.
• You decide that a sample of 10 students should be sufficient for your purposes.
• In order to get your sample, you assign a number from 1 to 500 to each student in your school.
34
• To select the sample, you use a table of randomly generated numbers.
• Pick a starting point in the table (a row and column number) and look at the random numbers that appear there. In this case, since the data run into three digits, the random numbers would need to contain three digits as well.
• Ignore all random numbers after 500 because they do not correspond to any of the students in the school.
• Remember that the sample is without replacement, so if a number recurs, skip over it and use the next random number.
• The first 10 different numbers between 001 and 500 make up your sample
35
• SRS has certain limitations:
– Requires a sampling frame.
– Difficult if the reference population is
dispersed.
– Minority subgroups of interest may not be
selected.
36
2. Systematic random sampling • Sometimes called interval sampling, systematic
sampling means that there is a gap, or interval, between each selected unit in the sample
• The selection is systematic rather than randomly– Individuals are chosen at regular interval from the
sampling frame. Ideally we randomly select a number to tell us where to start selecting individuals from the list.
• Important if the reference population is arranged in some order:– Order of registration of patients– Numerical number of house numbers– Student’s registration books– Taking individuals at fixed intervals (every kth) based
on the sampling fraction, eg. if the sample includes 20%, then every fifth. 37
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the total population size).
2. Determine the sampling interval (K) by dividing the number of units in the population by the desired sample size.
38
Steps….In order to find one study unit, during survey, it is
important to figure out how many houses must be visited usually through doing a pilot study.
• Example: Assume you are doing a study involving children under 5. There are 1500 households in all, and you have a required sample size of 100 children. From a preliminary study you have done, there is one child every 2.5 households. Normally, if there were a child in every household, you would visit 100 households. But because not every household includes a child, you will need to visit 100 x 2.5 or 250 households to find the required 100 children.
• The sampling interval will therefore be1500/250 or every 6th household.
39
3. Select a number between one and K at random. This number is called the random start and would be the first number included in your sample.
4. Select every Kth unit after that first number Note: Systematic sampling should not be
used when a cyclic repetition is inherent in the sampling frame.
40
Example
To select a sample of 100 from a population of 400, you would need a sampling interval of 400 ÷ 100 = 4.
Therefore, K = 4. You will need to select one unit out of every four units to
end up with a total of 100 units in your sample. Select a number between 1 and 4 from a table of random
numbers. • If you choose 3, the third unit on your frame would
be the first unit included in your sample;
• The sample might consist of the following units to make up a sample of 100: 3 (the random start), 7, 11, 15, 19...395, 399 (up to N, which is 400 in this case).
41
The main difference with SRS, any combination of 100 units would have a chance of making up the sample, while with systematic sampling, there are only four possible samples.
42
Advantages .
• Systematic sampling is usually less time consuming and easier to perform than SRS
• It provides a good approximation to SRS (. i.e. has highest precision)
• Unlike SRS, systematic sampling can be conducted without a sampling frame. So, systematic random sampling is useful when preparing sampling frame is not readily available. – E.g. In patients attending a health center,
where it is not possible to predict in advance who will be attending
43
Disadvantage
• If there is any sort of cyclic pattern in the ordering of the subjects, which coincides with the sampling interval, the sample will not be representative of the population. – May result in systematic error
44
3. Stratified random sampling
• It is done when the population is known to have heterogeneity with regard to some factors and those factors are used for stratification
• Using stratified sampling, the population is divided into homogeneous, mutually exclusive groups called strata, and – A population can be stratified by any variable that is available for
all units prior to sampling (e.g., age, sex, province of residence, income, etc.).
• A separate sample is taken independently from each stratum.
• Any of the sampling methods mentioned in this section (and others that exist) can be used to sample within each stratum.
45
Why do we need to create strata?
• That it can make the sampling strategy more efficient. • A larger sample is required to get a more accurate estimation
if a characteristic varies greatly from one unit to the other.• For example, if every person in a population had the same
salary, then a sample of one individual would be enough to get a precise estimate of the average salary.
• This is the idea behind the efficiency gain obtained with stratification. – If you create strata within which units share similar
characteristics (e.g., income) and are considerably different from units in other strata (e.g., occupation, type of dwelling) then you would only need a small sample from each stratum to get a precise estimate of total income for that stratum.
46
– Then you could combine these estimates to get a precise estimate of total income for the whole population.
• If you use a SRS approach in the whole population without stratification, the sample would need to be larger than the total of all stratum samples to get an estimate with the same level of precision.
47
• Stratified sampling ensures an adequate sample size for sub-groups in the population of interest.
• When a population is stratified, each stratum
becomes an independent population and you will need to decide the sample size for each stratum.
48
• Equal allocation:– Allocate equal sample size to each stratum
• Proportionate allocation: , j = 1, 2, ..., k where, k is
the number of strata and
– nj is sample size of the jth stratum– Nj is population size of the jth stratum – n = n1 + n2 + ...+ nk is the total sample
size – N = N1 + N2 + ...+ Nk is the total
population size
nn
N Nj j
49
4. Cluster sampling
• Sometimes it is too expensive to spread a sample across the population as a whole.
• Travel costs can become expensive if interviewers have to survey people from one end of the country to the other.
• To reduce costs, researchers may choose a cluster sampling technique
• The clusters should be homogeneous, unlike stratified sampling where by the strata are heterogeneous
50
Steps in cluster sampling
• Cluster sampling divides the population into groups or clusters.
• A number of clusters are selected randomly to represent the total population, and then all units within selected clusters are included in the sample.
• No units from non-selected clusters are included in the sample—they are represented by those from selected clusters.
• This differs from stratified sampling, where some units are selected from each group.
51
Example
• In a school based study, we assume students of the same school are homogeneous.
• We can select randomly sections and include all students of the selected sections only
52
• As mentioned, cost reduction is a reason for using cluster sampling.
• It creates 'pockets' of sampled units instead of spreading the sample over the whole territory.
• Another reason is that sometimes a list of all units in the population is not available, while a list of all clusters is either available or easy to create.
53
• In most cases, the main drawback is a loss of efficiency when compared with SRS.
• It is usually better to survey a large number of
small clusters instead of a small number of large clusters. – This is because neighboring units tend to be
more alike, resulting in a sample that does not represent the whole spectrum of opinions or situations present in the overall population.
54
• Another drawback to cluster sampling is that you do not have total control over the final sample size.
• Since not all schools have the same number of (say Grade 11) students and city blocks do not all have the same number of households, and you must interview every student or household in your sample, as an example, the final size may be larger or smaller than you expected.
55
5. Multi-stage sampling
• Similar to the cluster sampling, except that it involves picking a sample from within each chosen cluster, rather than including all units in the cluster.
• This type of sampling requires at least two stages.
56
• In the first stage, large groups or clusters are identified and selected. These clusters contain more population units than are needed for the final sample.
• In the second stage, population units are picked from within the selected clusters (using any of the possible probability sampling methods) for a final sample.
57
• If more than two stages are used, the process of choosing population units within clusters continues until there is a final sample.
• With multi-stage sampling, you still have the benefit of a more concentrated sample for cost reduction.
• However, the sample is not as concentrated as other clusters and the sample size is still bigger than for a simple random sample size.
58
• Also, you do not need to have a list of all of the units in the population. All you need is a list of clusters and list of the units in the selected clusters.
• Admittedly, more information is needed in this type of sample than what is required in cluster sampling. However, multi-stage sampling still saves a great amount of time and effort by not having to create a list of all the units in a population.
59
1.4.2.. Non-probability sampling
• The difference between probability and non-probability sampling has to do with a basic assumption about the nature of the population under study.
• In probability sampling, every item has a known chance of being selected.
• In non-probability sampling, there is an assumption that there is an even distribution of a characteristic of interest within the population.
60
• This is what makes the researcher believe that any sample would be representative and because of that, results will be accurate.
• For probability sampling, random is a feature of the selection process, rather than an assumption about the structure of the population.
61
• In non-probability sampling, since elements are chosen arbitrarily, there is no way to estimate the probability of any one element being included in the sample.
• Also, no assurance is given that each item has a
chance of being included, making it impossible either to estimate sampling variability or to identify possible bias
62
• Reliability cannot be measured in non-probability sampling; the only way to address data quality is to compare some of the survey results with available information about the population.
• Still, there is no assurance that the estimates will meet an acceptable level of error.
• Researchers are reluctant to use these methods because there is no way to measure the precision of the resulting sample.
63
• Despite these drawbacks, non-probability sampling methods can be useful when descriptive comments about the sample itself are desired.
• Secondly, they are quick, inexpensive and convenient.
• There are also other circumstances, such as researches, when it is unfeasible or impractical to conduct probability sampling.
64
common types of non-probability sampling
1. Convenience or haphazard sampling 2. Volunteer sampling 3. Judgment sampling 4. Quota sampling5. Snowball sampling technique
65
1.4.Scales of measurement
• Measurement: the assignment of numbers or names or events according to a set of rules:
• Clearly not all measurements are the same.• Measuring an individuals weight is qualitatively
different from measuring their response to some treatment on a three category of scale, “improved”, “stable”, “not improved”.
• Measuring scales are different according to the degree of precision involved.
• There are four types of scales of measurement.
66
Scales…
1. Nominal scale: uses names, labels, or symbols to assign each measurement to one of a limited number of categories that cannot be ordered.– Examples: Blood type, sex, race, marital status
2. Ordinal scale: assigns each measurement to one of a limited number of categories that are ranked in terms of a graded order.– Examples: Patient status, Cancer stages
67
Scales…
3. Interval scale: assigns each measurement to one of an unlimited number of categories that are equally spaced. It has no true zero point.– Example: Temperature measured on Celsius or
Fahrenheit4.Ratio scale: measurement begins at a true zero
point and the scale has equal space.– Eg: Height, weight, blood pressure
68
Scales…
69
1.5.Validity and reliability
Validity and Reliability are two major requirements for any measurement. – Validity pertains to the correctness of the
measure; a valid tool measures what it is supposed to measure.
– Reliability pertains to the consistency of the tool across different contexts.
• Validity is often described as internal or external.
70
1.6.Sources and methods of data Collection and it’s handling
SourcesTwo major sources
Primary sources-are those data, which are collected
by the investigator himself/herself for the purpose of a
specific inquiry or study. Such data are original in character and are mostly generated
by surveys conducted by individuals or research institutions.
The first hand information obtained by the investigator is more reliable and accurate since the investigator can extract the correct information by removing doubts, if any, in the minds of the respondents regarding certain questions. High response rates might be obtained since the answers to various questions are obtained on the spot. It permits explanation of questions concerning difficult subject matter. 71
Secondary data
Secondary Data: When an investigator uses data,
which have already been collected by others, such
data are called "Secondary Data". Such data are
primary data for the agency that collected them, and
become secondary for someone else who uses these
data for his/her own purposes.The secondary data can be obtained from journals,
reports ofdifferent institutions, government publications,
publications ofprofessionals and research organizations. These data are
less expensive and can be collected in a short time.
72
Data collection methods 1.Observation
• is a technique that involves systematically selecting,
watching and recoding behaviours of people or other
phenomena and aspects of the setting in which they
occur, for the purpose of getting specified information.
• includes all methods from simple visual observations
to the use of high level machines and measurements,
sophisticated equipment or facilities, such as
radiographic, biochemical, X-ray machines,
microscope, clinical examinations, and microbiological
examinations.
73
Observation…
• Advantages: Gives relatively more accurate data
on behaviour and activities
• Disadvantages: Investigators or observer’s own
biases, prejudice, desires, and etc. .
• needs more resources and skilled human power
during the use of high level machines.
74
2. The Documentary sources• Include clinical records and other personal records,
published mortality statistics, census publications,
etc.• Advantages:a) Documents can provide ready-made information
relatively easilyb) The best means of studying past events• Disadvantages: a) Problems of reliability and validity (because the
information is collected by a number of different persons who may have used different definitions or methods of obtaining data).
b) There is a possibility that errors may occur when the information is extracted from the records . 75
3. Interviews and self-administered questionnaire
a) Interviews: may be less or more structured.
A public health worker conducting interviews may be
armed with a checklist of topics, but may not decide
in advance precisely what questions he/she will ask.
• This approach is flexible; the content, wording and
order of the questions are relatively unstructured.
– the content, wording and order of the questions vary from
interview to interview.
76
Interviews…
On the other hand, in other situations a more standardized
technique may be used, the wording and order of the
questions being decided in
advance.
This may take the form of a highly structured
interview(interviewing using questionnaire),
• the investigator appoints persons/enumerators, who go
to the respondents personally with the questionnaire,
ask them questions and record their replies.
– This can be done using telephone or face-to-face interviews.77
Interviews…
• Questions may take two general forms: they may
be “open ended” questions, which the subject
answers in his/her own words,
• or “closed” questions, which are answered by
choosing from a number of fixed alternative
responses.
78
Advantage of interview
• A good interviewer can stimulate and maintain the respondent’s interest. This leads to the frank answering of questions.
• If anxiety is aroused (e.g., why am I being asked these questions?) , the interviewer can allay it.
An interviewer: • can repeat questions which are not understood,
and give standardized explanations where necessary.
• can ask “follow-up” or “probing” questions to clarify a response.
• can make observations during the interview;• i.e., note is taken not only of what the subject
says but also how he/she says it.
79
b. self-administered questionnaire
• The respondent reads the questions and fills in the
answers by himself/herself (sometimes in the presence of
an interviewer who “stands by” to give assistance if
necessary).
• The use of self-administered questionnaires is simpler and cheaper;
• can be administered to many persons simultaneously (e.g. to a class of school children).
• They can be sent by post. However, they demand a certain level of education on the part of the respondent.
80
.
• Quantitative data are commonly collected using
structured interviews (where standard questionnaires
are common and the collected data can relatively be
processed easily) where as,
• qualitative data are usually collected using
unstructured interviews.
• The unstructured interviews are undertaken by the
help of check lists, key informant interviews, focus
group discussions, etc.
81
Qualitative…Checklist - is a list of questions prepared ahead of
time to facilitate the interviews or discussions. It is not an exhaustive one. It helps the facilitator not to miss any of the important topics under consideration.
Key informant interviews – interviews done with influential individuals (such as community elders, priests, etc.).
Focus group discussions – discussions made with a group of respondents.
• The group contains 6 to 12 people who are more or less similar with respect to level of education, marital status, age, sex, etc. (this composition helps each respondent to talk freely without being dominated by the other). 82
Steps in Questionnaire Design
1. Before beginning to construct, make sure that the
questionnaire is the best method of collecting data
for your objectives
– To know before hand what information is needed and
what is going to be done with this information
2. While drafting the questions one has to know: Why
question is asked and what will be done with
information (to prevent wastage of extra resources)
83
Steps in…3. To get valid and reliable information:• the wording and sequence of question should be
able to facilitate their recall or remember• prevent forgetfulness of the respondents• avoid difficult/ time consuming or embarrassing
or too personal question• the flow of questions should be from simple to
complex and from general to specific, from impersonal to personal
• confidentiality care should be taken for the respondent
• Cover letter( if by mail)• Identify by ID(rather than name)
84
Data Collection and handling Process
85
Data collection
A plan for data collection can be made in two steps:
1. Listing the tasks that have to be carried out and who should be involved, making a rough estimate of the time needed for the different parts of the study, and identifying the most appropriate period in which to carry out the research
2. Actually scheduling the different activities that have to be carried out each week in a work plan
86
Why should you develop a plan for data collection?
A plan for data collection should be developed so that:– you will have a clear overview of what tasks
have to be carried out, who should perform them, and the duration of these tasks;
– you can organize both human and material resources for data collection in the most efficient way; and
– you can minimize errors and delays which may result from lack of planning (for example, the population not being available or data forms being misplaced).
87
Data collection process
Stages
• Stage 1: Permission to proceed– Obtaining consent from the relevant
authorities, individuals and the community in which the project is to be carried out
88
Data collection processStage 2: Data collection• Logistics
– who will collect what, – when and – with what resources
• Quality control – Prepare a field work manual – Select your research assistants– Train research assistants– Supervision– Checked for completeness and accuracy
89
Data collection process
• How long will it take to collect the data for each component of the study?– Step 1: Consider the time required to
reach the study area; to locate the study units; the number of visits required per study unit and for follow-up of non-respondents
– Step 2: Calculate the number of interviews that can be carried out per person per day
– Step 3: Calculate the number of days needed to carry out the interviews.
90
Ensuring data qualityMeasures to help ensure good quality of
data: Prepare a field work manual for the
research team as a whole Select your research assistants, if
required, with care Train research assistants carefully in all
topics covered in the field work manual as well as in interview techniques
Pre-test research instruments and research procedures with the whole research team, including research assistants. 91
Ensuring data quality
Take care that research assistants are not placed under too much stress
Arrange for on-going supervision of research assistants and guidelines should be developed for supervisory tasks.
Devise methods to assure the quality of data collected by all members of the research team.
92
Data Collection Process
Stage 3: Data handling• Once the data have been collected and
checked for completeness and accuracy, a clear procedure should be developed for handling and storing them
• Numbering of all questionnaires• Identify the person responsible for storing
data and the place where it will be stored• Decide how data should be stored. Record
forms should be kept in the sequence in which they have been numbered.
93
Research Assistants• This includes – data collectors,
supervisors and may be local guides • Selection – during selection one
should consider similarities in educational level and may be sex composition
• Training – all research assistants and team members should be trained together
94
Pre-test and pilot study
A pre-test usually refers to a small-scale trial of particular research components.
A pilot study is the process of carrying out a preliminary study, going through the entire research procedure with a small sample.
Why do we carry out a pre-test or pilot study?
A pre-test or pilot study serves as a trial run that allows us to identify potential problems in the proposed study.
95
Pre-test and pilot studyWhat aspects of your research methodology can
be evaluated during pre-testing?1. Reactions of the respondents to the research
procedures can be observed in the pre-test – availability and willingness
2. The data-collection tools can be pre-tested3. Sampling procedures can be checked4. Staffing and activities of the research team
can be checked, while all are involved in the pre-test
5. Procedures for data processing and analysis can be evaluated during the pre-test
6. The proposed work plan and budget for research activities can be assessed during the pre-test. 96
Plan for data processing & analysis
• Data processing and analysis should start in the field, with checking for completeness of the data and
• Performing quality control checks, while sorting the data by instrument used and by group of informants
• Data of small samples may even be processed and analyzed as soon as it is collected.
97
Plan for data processing & analysis
• The plan for data processing and analysis must be made after careful consideration of the objectives of the study as well as of the tools developed to meet the objectives.
• The procedures for the analysis of data collected through qualitative and quantitative techniques are quite different.– For quantitative data the starting point in
analysis is usually a description of the data for each variable
– For qualitative data it is more a matter of describing, summarizing and interpreting the data obtained for each study unit
98
Plan for data processing & analysis
• When making a plan for data processing and analysis the following issues should be considered:– Sorting data,�– Performing quality-control checks,�– Data processing, and�– Data analysis.�
99
Data processing and analysis
• Sorting data – Into groups of different study
populations or comparison groups
• Quality control checks– Check again for completeness and
internal consistency– Missing data - if many exclude the
questionnaire– Inconsistency - correct, return or
exclude
100
Data processing
• Decide whether to process and analyse the data from questionnaires:– manually, using data master sheets or manual
compilation of the questionnaires, or– by computer, for example, using a micro-
computer and existing software or self-written programmes for data analysis.
• Data processing in both cases involves:• categorising the data,• coding, and• summarising the data in data master sheets, manual
compilation without master sheets, or• data entry and verification by computer.
101
2.Descriptive statistics
(Data summarization)
102
2.Data summarization(Descriptive statistics)
2.1.Describing variablesThe methods of describing variables differ
depending on the type of data Categorical or NumericalSome times we transform numeric data into
categorical.eg age.– when lesser degree detail is required
• This is achieved by dividing the range of values, which the
numeric variable takes into intervals.
103
Describing…
Categorical variables• Table of frequency distributions
– Frequency– Relative frequency– Cumulative frequencies
• Charts– Bar charts– Pie charts
104
Describing …
105
In summary,• There are three ways we can summarize and
present data:• Tabular representation - summarizing data by
making a table of the data called frequency distributions.
• Graphical representation of data - we can make a graph of the data.
• Numerical representation of data - we can use a single number to represent many numbers. – Measures of central tendency. – Measures of variability.
106
2.2. Frequency Distribution• A frequency distribution shows the number of observations
falling into each of several ranges of values.• Four different types of frequency distributions.
– Simple frequency distribution (or it can be just called a frequency distribution).
– Cummulative frequency distribution. – Grouped frequency distribution. – Cummulative grouped frequency distribution.
• Are portrayed as Frequency tables, histograms, or
polygons
• Can show either the actual number of observations falling
in each range or the percentage of observations. In the
latter instance, the distribution is called a relative
frequency distribution107
Simple frequency distribution
Data Set - High Temperatures for 30 Days
50 45 49 50 43
49 50 49 45 49
47 47 44 51 51
44 47 46 50 44
51 49 43 43 49
45 46 45 51 46
Consider the following set of data which are
the high temperatures recorded for 30
consecutive days. We wish to summarize
this data by creating a frequency
distribution of the temperatures.
108
Simple frequency distribution…
.
To create a frequency distribution from this data proceed as follows:
1. Identify the highest and lowest values in the data set. For our temperatures the highest temperature is 51 and the lowest temperature is 43.
2. Create a column with the title of the variable we are using, in this case temperature. Enter the highest score at the top, and include all values within the range from the highest score to the lowest score.
109
Simple frequency…
3. Create a tally column to keep track of the scores as you enter them into the frequency distribution. Once the frequency distribution is completed you can omit this column
4. Create a frequency column, with the frequency of each value, as show in the tally column, recorded.
5. At the bottom of the frequency column record the total frequency for the distribution proceeded by N =
6. Enter the name of the frequency distribution at the top of the table.
110
Simple frequency…
Frequency Distribution for High Temperatures
Temperature Tally Frequency
51 //// 4
50 //// 4
49 //// / 6
48
0
47 /// 3
46 /// 3
45 //// 4
44 /// 3
43 /// 3
N = 30
If we applied these steps to the temperature data above we would have the following frequency distribution
111
Cumulative frequency distributionTo create a cummulative frequency
distribution:• Create a frequency distribution • Add a column entitled cummulative
frequency • The cummulative frequency for each score
is the frequency up to and including the frequency for that score
• The highest cummulative frequency should equal N (the total of the frequency column)
112
Cumulative frequency…
113
Cummulative Frequency Distribution for High Temperatures
Temperature Tally Frequency Cummulative Frequency
51 //// 4 30
50 //// 4 26
49 ////// 6 22
48
0 16
47 /// 3 16
46 /// 3 13
45 //// 4 10
44 /// 3 6
43 /// 3 3
N = 30
Grouped frequency distributionTo create a grouped frequency distribution:• select an interval size so that you have 7-20 class
intervals Al so By using surges’ rule
• create a class interval column and list each of the class intervals
• each interval must be the same size, they must not overlap, there may be no gaps within the range of class intervals
• create a tally column (optional) • create a midpoint column for interval midpoints • create a frequency column • enter N = some value at the bottom of the
frequency column
114
Grouped frequency for the temperature data
Grouped Frequency Distribution for High Temperatures
Class Interval Tally Interval Midpoint Frequency
57-59 ////// 58 6
54-56 /////// 55 7
51-53 /////////// 52 11
48-50 ///////// 49 9
45-47 /////// 46 7
42-44 ////// 43 6
39-41 //// 40 4
N = 50
115
Cumulative grouped frequency distribution
Cumulative Grouped Frequency Distribution for High Temperatures
Class Interval Tally Interval Midpoint Frequency Cumulative Frequency
57-59 ////// 58 6 50
54-56 /////// 55 7 44
51-53 /////////// 52 11 37
48-50 ///////// 49 9 26
45-47 /////// 46 7 17
42-44 ////// 43 6 10
39-41 //// 40 4 4
N = 50
We just add a cumulative frequency column to the grouped frequency distribution and we have a cumulative grouped frequency distribution as shown below.
116
Relative Frequency• Sometimes it is useful to compute the proportion, or
percentages of observations in each category.
• Relative frequency of a particular category is the
proportion(fracttion) of observations that fall into the
particular category.
• The cumulative frequency (or proportions) is addition of
the frequencies in each category from zero to a particular
category.
– Is the relative frequency of items less than or equal to
the upper class limit of each class.
• For quantitative data and for categorical (qualitative) data
(but only if the latter are ordinal ) 117
Characteristics and guidelines of table construction
Characteristics
• Table must be explanatory
• Title should describe the content of the table and
should answer the question what? Where? And
when? It was collected
• Percentages in each category should add up to 100
• Foot notes should be placed at the bottom of the
table
118
Guidelines • The shape and size of the table should contain the required
number of raw and Columns to accommodate the whole data
• If a quantity is zero, it should be entered as zero, and leaving
blank space or putting dash in place of zero is confusing and
undesirable
• In case two or more figures are the same, ditto marks should
not be used in a table in the place of the original numerals
• If any figures in a table has to be specified for a particular
purpose, it should be marked with asterisk
119
2.3. Diagrammatic Representation
2.3.1. Importance of diagrammatic representation:
1.Diagrams have greater attraction than mere figures. They give delight to the eye, add a spark of interest and as such catch the attention as much as the figures dispel it.
2.They help in deriving the required information in less time and without any mental strain.
3.They have great memorizing value than mere figures. This is so because the impression left by the diagram is of a lasting nature.
4.They facilitate comparison
120
Importance….
Well designed graphs can be an incredibly powerful means of communicating a great deal of information using visual techniques
When graphs are poorly designed, they not only do not effectively convey your message, they often mislead and confuse.
121
2.3.2.Types 1. Bar graph
•Bar diagram is the easiest and most adaptable general purpose chart.
•Though this type of chart can be used for any type of series, it is especially satisfactory for nominal and ordinal data.
•The categories are represented on the base line (X-axis) at regular interval and the corresponding values of frequencies or relative frequencies represented on the Y-axis (ordinate) in the case of vertical bar diagram and vis-versa in the case of horizontal bar diagram.
122
Method of constructing bar graph•All bars drawn in any single study should be of the same
width•The different bars should be separated by equal distances•All the bars should rest on the same line called the base•It is better to construct a diagram on a graph paper
Types of bar graph• 1.Simple bar graph: It is one-dimensional diagram in
which the bar represents the whole of the magnitude. The height/length of each bar indicates the frequency of the figure represented.
Example: Construct a bar graph for the following data
123
Table__, Distribution of pediatric patients in X hospital ward by type of admitting diagnosis Jan, 2000
Diagnosis Number of patients Relative freq (%)
Pneumonia 487 48.7
Malaria 200 20
Cardiac problems 168 16.8
Malnutrition 80 8.0
Others 65 6.5
Total 1000 100
124
1. Simple bar graph…
.
125
2.Sub-divided (component) bar graph
• It is also called segmented bar graph. If a given
magnitude can be split up into subdivisions, or if there are different quantities forming the subdivisions of the totals, simple bars may be subdivided in the ratio of the various subdivisions to exhibit the relationship of the parts to the whole.
• The order in which the components are shown in a "bar" is followed in all bars used in the diagram.
126
2.Sub-divided…
127
3. Multiple bar graph
Multiple Bar diagrams can be used to represent the relationships among more than two variables.
The following figure shows the relationship between children’s reports of breathlessness and cigarette smoking by themselves and their parents.
128
3. Multiple bar graph…
129
3. Multiple bar graph…
• We can see from the graph quickly that the prevalence of the system increases both with the child's smoking and with that of their parents.
130
2. Pie chart
Pie chart shows the relative frequency for each category by dividing a circle into sectors, the angles of which are proportional to the relative frequency.
Steps to construct a pie-chart Construct a frequency table Change the frequency into percentage (P) Change the percentages into degrees, where:
degree = Percentage X 360o Draw a circle and divide it accordingly
131
2. Pie chart…
Example: Distribution of death for females, in England and Wales, 1989.
• --
132
Cause of death Number (%)of deaths
Circulatory system (C) 100,000
Neoplasm (N) 70,000
Respiratory system(R) 30,000
Injury & poisoning (I) 6,000
Digestive system (D) 10,000
Others (O) 20,000
Total 236,000
2. Pie chart…
133
3.Histogram
Histograms are frequency distributions with continuous class interval that have been turned into graphs.
To construct a histogram, we draw the interval boundaries on a horizontal line and the frequencies on a vertical line.
Non-overlapping intervals that cover all of the data values must be used.
Bars are then drawn over the intervals in such a way that the areas of the bars are all proportional in the same way to their interval frequencies.
134
Example: Distribution of the RBC cholinesterase values (µmol/min/ml) obtained from 35 workers Exposed to Pesticides
eg.
135
RBC cholinesterase (µmol/ min/ ml) Frequency, n (%) Cumulative frequency (%)
5.95-7.95 1(2.9) 2.9
7.95-9.95 8(22.9) 25.8
9.95-11.95 14(40) 65.8
11.95-13.95 9(25.7) 91.5
13.95-15.95 2(5.7) 97.2
15.95-17.95 1(2.9) 100
Total 35(100)
Source: Knapp RG, Miller MC III: Clinical Epidemiology and biostatistics
3.Histogram…
• .
136
RBC choilinesterase(umol/min/ml)
16.9514.9512.9510.958.956.95
Histogram of the RBC cholinesterase values of 35
pesticide exposed workers
Num
ber
of
pesticid
e e
xposed w
ork
ers 16
14
12
10
8
6
4
2
0
4.Frequency polygonA frequency distribution can be portrayed graphically in
yet another way by means of a frequency polygon. •To draw a frequency polygon we connect the mid-point of
the tops of the cells of the histogram by a straight line. •It can be also drawn without erecting rectangles as
follows:
The scale should be marked in the numerical values of the mid-points of intervals.
Erect ordinates on the mid-point of the interval-the length or altitude of an ordinate representing the frequency of the class on whose mid-point it is erected.
Join the tops of the ordinates and extend the connecting line to the scale of sizes.
137
4.Frequency polygon…
138
5.Cumulative frequency polygon (ogive curve)
Some times it may become necessary to know the number of items whose values are more or less than a certain amount.
•We may, for example, be interested in knowing the number of patients whose weight is less than 50 Kg or more than say 60 Kg.
•To get this information it is necessary to change the form of the frequency distribution from a ‘simple’ to ‘cumulative' distribution.
•Ogive curve turns a cumulative frequency distribution in to graphs.
139
5.Cumulative frequency polygon (ogive curve)…
Example: Heart rate of patients admitted to Hospital B, 2000
140
Heart rate
(Beat/ min)
No. of patients Cumulative freq., less
than method
Cumulative freq.,
greater than method
54.95-59.5 1 1 54
59.5-64.5 5 6 53
64.5-69.5 3 9 48
69.5-74.5 5 14 45
74.5-79.5 11 25 40
79.5-84.5 16 41 29
84.5-89.5 5 46 13
89.5-94.5 5 51 8
94.5-99.5 2 53 3
99.5-104.5 1 54 1
Total 54
5.Cumulative frequency polygon (ogive curve)…
141
6.Box-and-whisker plotIt is another way to display information when the
objective is to illustrate certain location in the distribution.
A box is drawn with the top of the box at the third quartile and the bottom at the first quartile.
The location of the midpoint of the distribution is indicated with a horizontal line in the box.
Finally, straight lines or whiskers are drawn from the center of the top of the box to the largest observation and from the center of the bottom of the box to the smallest observation.
Useful When one of the characteristics is qualitative and the other is quantitative
142
Eg: percentage super saturation of bile by sex of patients Men Women
.
143
Subject Age %Super saturation
Subject Age %Super saturation
1 23 40 1 40 65 2 31 86 2 33 86 3 58 11 3 49 76 4 25 86 4 44 89 5 63 106 5 63 142 6 43 66 6 27 58 7 67 123 7 23 98 8 48 90 8 56 146 9 29 112 9 41 80 10 26 52 10 30 66 11 64 88 11 38 52 12 55 137 12 23 35 13 31 88 13 35 55 14 20 80 14 50 127 15 23 65 15 47 77 16 43 79 16 36 91 17 27 87 17 74 128 18 63 56 18 53 75 19 59 110 19 41 82 20 53 106 20 25 89 21 66 110 21 57 84 22 48 78 22 42 116 23 27 80 23 49 73 24 32 47 24 60 87 25 62 74 25 23 76 26 36 58 26 48 107 27 29 88 27 44 84 28 27 73 28 37 120 29 65 118 29 57 123 30 42 67 31 60 57
Box-and-whisker plot…
144
Box-and-whisker plot• The graphs indicate the similarity of the
distribution between the percentage saturation of bile in men and women.
•Again, we see that percentage saturation of bile is a bit more spread out among women with range 35 to 146 but we see also that the mid-points of the distributions are almost the same and that most of the spread in values in women occurs in the upper half of the distribution.
145
7.Scatter plotMost studies in medicine involve measuring more than
one characteristic, and graphs displaying the relationship between two characteristics are common in the literature.
• To illustrate the relationship between two characteristics when both are quantitative variables we use bivariate plots (also called scatter plots or scatter diagrams).
A scatter diagram is constructed by drawing X-and Y-axes.
•Each observation is represented by a point or dot(•). •In the same study on percentage saturation of bile,
information was collected on the age of each patient to see whether a relationship existed between the two measures, the following plot was displayed. 146
7.Scatter plot…
147
The graph suggests the possibility of a positive relationship between age and percentage saturation of bile in women.
8.Line graphIn this type of graph, we have two variables under
consideration like that of scatter diagram.
•A variable is taken along X-axis and the other along Y-axis.
•The points are plotted and joined by line segments in order.
•These graphs depict the trend or variability occurring in the data.
•Sometimes two or more graphs are drawn on the same graph paper taking the same scale so that the plotted graphs are comparable.
Example:The following graph shows level of zidovudine(AZT) in the
blood of AIDS patients at several times after administration of the drug, with normal fat absorption and with fat mal absorption.
148
Response to administration of zidovudine in two groups of AIDS patients in hospital X, 1999.
149
Data Summarization (Numeric Summery)
150
Measures of central tendency
On the scale of values of a variable there is a certain stage at which the largest number of items tend to cluster.
Since this stage is usually in the centre of distribution, the tendency of the statistical data to get concentrated at certain values is called “central tendency”
The various methods of determining the actual value at which the data tends to concentrate are called measures of central tendency.
151
Measures of central tendency…The most important objective of calculating
measure of central tendency is to determine a single figure which may be used to represent a whole series involving magnitude of the same variable.
In that sense it is an even more compact description of the statistical data than the frequency distribution.
•Since a measure of central tendency represents the entire data, it facilitates comparison with in one group or between groups of data.
152
Measures of central tendency…Characteristics of a good measure of central
tendencyA measure of central tendency is good or
satisfactory if it possesses the following characteristics.
1.It should be based on all the observations2.It should not be affected by the extreme values3.It should be as close to the maximum number of
values as possible4.It should have a definite value5.It should not be subjected to complicated and
tedious calculations6.It should be capable of further algebraic
treatment7.It should be stable with regard to sampling
153
Arithmetic mean (x) The most familiar MCT is the AM. It is also
popularly known as average. a) Ungrouped data If x1.,x2., ..., xn are n observed values,
Then:
154
Arithmetic mean…b) Grouped data .In calculating the mean from
grouped data, we assume that all values falling into a particular class interval are located at the mid-point of the interval. It is calculated as follow:
155
where, k = the number of class intervals mi = the mid-point of the ith class interval fi = the frequency of the ith class interval
Arithmetic mean…Example.
156
Mean = 2630/100 = 26.3
Arithmetic mean…• The arithmetic mean possesses the following
properties.• Uniqueness: For given set of data there is one
and only one arithmetic mean.• Simplicity: The arithmetic mean is easily
understood and easy to compute.• Center of gravity: Algebraic sum of the
deviations of the given values from their arithmetic mean is always zero.
• Sensitivity: The arithmetic mean possesses all the characteristics of a central value, except No.2, (is greatly affected by the extreme values).
• In case of grouped data if any class interval is open, arithmetic mean can not be calculated
157
The Median(X)
• a) Ungrouped data•The median of a finite set of values is that value which
divides the set of values in to two equal parts such that the number of values greater than the median is equal to the number of values less than the median.
•If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude.
•When the number of observations is even, there is no single middle observation but two middle observations. •In this case the median taken to be the mean of
these two middle observations, when all observations have been arranged in the order their magnitude
158
The Median…
b) Grouped data• In calculating the median from grouped data, we
assume that the values within a class-interval are evenly distributed through the interval.
• The first step is to locate the class interval in which it is located. We use the following procedure.
• Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2.
• To find a unique median value, use the following interpolation formal.
159
Median…
160
Where,Lm= lower true class boundary of the interval containing the medianFc= cumulative frequency of the interval just above the median class intervalfm= frequency of the interval containing the medianW= class interval widthn = total number of observations
Median…..Example
161
n/2 = 75/2 = 37.5Median class interval = 35-44Lm=34.5 ,Fc= 35, W = 10, n = 75,fm=22•Median = 34.5 + (37.5-35)/22 x 10 = 35.64
Properties of the median• There is only one median for a given set of
data• The median is easy to calculate• Median is a positional average and hence it is
not drastically affected by extreme values • Median can be calculated even in the case of
open end intervals• It is not a good representative of data if the
number of items is small
162
Mode (x) a) Ungrouped data•It is a value which occurs most frequently in a set of
values. •If all the values are different there is no mode, on the
other hand, a set of values may have more than one mode.
b) Grouped data• In designating the mode of grouped data, we usually
refer to the modal class, where the modal class is the class interval with the highest frequency.
• If a single value for the mode of grouped data must be specified, it is taken as the mid point of the modal class interval.
163
Properties of mode
• It is not affected by extreme values • It can be calculated for distributions with open
end classes• Often its value is not unique• The main drawback of mode is that often it does
not exist
164
MEASURES OF POSITIONS Quartiles
• Divide the distribution into four equal parts. The 25th percentile demarcates the first quartile (Q1),
• the median or 50th percentile demarcates the second quartile (Q2),
• the 75th percentile demarcates the third quartile (Q3), • and the 100th percentile demarcates the fourth quartile
(Q4), which is the maximum observation. Q1 is the ¼ (n+1)th measurement, i.e, 25% of all the ranked
observations are less than Q1.
Q2 is 2/4 (n+1)th = (n+1 /2)th measurement. I.e. 50% of all ranked observations are less than Q2. Q2=2 Q1
Q3 is the ¾ (n+1)th observation. Q3= 3 Q1. It indicates that 75% of all the ranked observations are less than Q3.
165
Percentile• Is Simply dividing the data into 100 pieces.• value in a set of data that has 100% of the
observations at or below it. When we consider it in this way, we call it the 100th percentile.
• From this same perspective, the median, which has 50% of the observations at or below it, is the 50th percentile.
• The pth percentile of a distribution is the value such that p percent of the observations are less than or equal to it.
The pth percentile value depends on whether np/100 is an integer or not:
The (k+1) Th largest sample point if np/100 is not an integer where k is the largest integer less than np/100.
The average of the (np/100) th and (np/100+1) th largest observation when np/100 is an integer
166
Percentiles…
Example: The following data is the sample of birth weights (grams) of live births at a hospital during a week period.
3265, 3248, 2838, 3323, 3245, 3101, 2581, 3200, 4146, 2759, 3609, 2069, 3260,
3314, 3541, 3649, 3484, 2834, 2841, 3031.Calculate the 10th and 90th percentilesSolution: n=20; p=0.1 & 0.9 First put the data in ascending
order 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245,
3248, 3260, 3265, 3314,3323,3484,3541,3609,3649,4146. 10th percentile = np/100= 20x0.1=2 which is an integer. So,
the 10th percentile will be the average of the 2nd and the 3rd ordered observation which is 2581+ 2759 divided by two which is equal to 2670 grams.
The 90th percentile=np/100= 20x0.9=18 which is an integer. So, the 90th percentile will be the average of the 18th and the 19th ordered observation which is 3609+ 3649 divided by two which is equal to 3629 grams.
167
Percentiles…
• Therefore, we would say that 80 percent of the birth weights would fall between 2607 g and 3629 g, which give us an overall feel for the spread of the distribution.
• The most commonly used percentiles other than
the median (50th percentile) are the 25th percentile and the 75th percentile.
168
Measures of variability
• The measure of central tendency alone is not enough to have a clear idea about the distribution of the data.
• Moreover, two or more sets may have the same mean and/or median but they may be quite different.
• Thus to have a clear picture of data, one needs to have a measure of dispersion or variability (scatterdness) amongst observations in the set.
169
Range (R)
R = XL-XS,
where
• XLis the largest value and XSis the smallest value.
• Properties• It is the simplest measure and can be easily
understood• It takes into account only two values which
causes it to be a poor measure of dispersion
170
Interquartilerange (IQR)
IQR = Q3-Q1,
Where,Q3is the third quartile and Q1is the first quartile.
Example: Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg respectively. The interrquartile range is therefore,
IQR = 10.2 Kg –8.8 Kg,i.e.,50% of infant girls at 12 months weigh between
8.8 and 10.2 Kg.
171
Interquartilerange …
172
Interquartile…• Generally, we use interquartile range to describe
variability when we use the median as the measure of central location. We use the standard deviation, which is described in the next section, when we use the mean.
Properties• It is a simple and versatile measure• It encloses the central 50% of the observations• It is not based on all observations but only on two
specific values• It is important in selecting cut-off points in the
formulation of clinical standards• Since it excludes the lowest and highest 25% values,
it is not affected by extreme values• It is not capable of further algebraic treatment 173
Quartile deviation (QD)
174
Coefficient of quartile deviation (CQD)
CQD is an absolute quantity (unit less) and is useful to compare the variability among the middle 50% observations.
Mean deviation (MD)
•Mean deviation is the average of the absolute deviations taken from a central value, generally the mean or median.
•Consider a set of n observations x1, x2, ..., xn.
Then,
175
Where, A is a central value (arithmetic mean or median).
Mean deviation …Properties• MD removes one main objection of the earlier
measures, that it involves each value • It is not affected much by extreme values• Its main drawback is that algebraic negative
signs of the deviations are ignored which is mathematically unsound
• MD is minimum when the deviations are taken from median.
176
The Variance (σ2, S2)
• The main objection of mean deviation, that the negative signs are ignored, is removed by taking the square of the deviations from the mean.
• The variance is the average of the squares of the deviations taken from the mean.
177
Variance…
a)Ungrouped dataLet X1, X2, ..., XN be the measurement on N
population units, then;
178
Variance…
The sample variance of the set x1, x2, ..., xn of n observations is:
179
Variance…
b)Grouped data
180
Variance…
Properties• The main demerit of variance is, that its unit is
the square of the unit of measurement of variate values
• The variance gives more weightage to the extreme values as compared to those which are near to mean value, because the difference is squared in variance.
• The drawbacks of variance are overcome by the standard deviation.
181
Standard deviation (σ, S)It is the positive square root of the variance.
182
Properties
•Standard deviation is considered to be the best measure of dispersion and is used widely because of the properties of the theoretical normal curve.•There is however one difficulty with it. If the units of measurements of variables of two series is not the same, then there variability can not be compared by comparing the values of standard deviation.Formula sheet for variance and standard deviation.docExample to calculate variance.doc
Coefficient of variation
• When we desire to compare the variability in two sets of data, the standard deviation which calculates the absolute variation may lead to false results.
• The coefficient of variation gives relative variation & is the best measure used to compare the variability in two sets of data. Never use SD to compare variability between groups.
• CV = standard deviation Mean
183
4.Basic Probability and probability distributions
• Probability is a mathematical technique for predicting outcomes. It predicts how likely it is that specific events will occur.
• An understanding of probability is fundamental for
quantifying the uncertainty that is inherent in the
decision-making process
• Probability theory also allows us to draw conclusions
about a population of patients based on known
information about a sample of patients drawn from
that population.
184
Basic Probability…
• Mutually exclusive events: Events that cannot occur together– For example, event A=“Male” and B=“Pregnant”
are two mutually exclusive events (as no males can be pregnant).
• Independent events: The presence or absence of one does not alter the chance of the other being present. – one event happens regardless of the other, and its
outcome is not related to the other.• Probability: If an event can occur in N mutually
exclusive and equally likely ways, and if m of these possess a characteristic E, the probability of the occurrence of E is P(E) = m/N.
185
4.1.Properties of probability
1.A probability value must lie between 0 and 1, 0≤P(E)≤1. A probability can never be more than 1.0, nor can it be
negative
• A value 0 means the event can not occur• A value 1 means the event definitely will occur• A value of 0.5 means that the probability that the
event will occur is the same as the probability that it will not occur.
• Probability is measured on a scale from 0 to 1.0 as shown in in the following Figure of probabilty scale.
186
Properties…
187
Fig.___
Properties…
2. The sum of the probabilities of all mutually exclusive outcome is equal to 1.P(E1) + P(E2) + .... + P(En) = 1
3. For any two events A and B,P(A or B) = P(A) + P(B) -P(A and B)(Addition rule)For two mutually exclusive events A and B, P(A or B ) = P(A) + P(B).
4. For any two independent events A and B– P(A and B) = P(A) P(B).(Multiplication rule)
188
Properties…• To calculate the probability of event (A) and event
(B) happening (independent events)for example, if you have two identical packs of cards (pack A and pack B),what is the probability of drawing the ace of spades from both packs?
• Formula: P(A) x P(B)P(pack A) = 1 card, from a pack of 52 cards = 1/52 = 0.0192P(pack B) = 1 card, from a pack of 52 cards = 1/52 = 0.0192P(A) x P(B) = 0.0192 x 0.0192 = 0.00037
5. If A’ is the complementary event of the event A, Then, P(A’) = 1 -P(A).
189
Example
• A study investigating the effect of prolonged exposure to bright light on retina damage in premature infants. Eighteen of 21 premature infants, exposed to bright light developed retinopathy, while 21 of 39 premature infants exposed to reduced light level developed retinopathy. For this sample, the probability of developing retinopathy is:
P(Retinopathy) = No. of infants with retinopathy Total No. of infants= 18 + 21 = 0.65 21 + 39
190
Example…
• The following data are the results of electrocardiograms (ECGs) and radionuclide angiocardiograms(RAs) for 19 patients with post-traumatic myocardial contusions. A “+”indicates abnormal results and a “-”indicates normal results.
• 1.Calculate the probability of both ECG and RA is abnormal
• 2.Calculate the probability that either the ECG or the RA is abnormal
191
Example
192
ExampleSolutions1.P(ECG abnormal and RA abnormal) = 7/19 = 0.372.P(ECG abnormal or RA abnormal) = P(ECG
abnormal) + P(RA abnormal) –P(Both ECG and RA abnormal)
=17/19 + 9/19 –7/19 = 19/19 =1• NB: We can not calculate the above probability by
adding the number of patients with abnormal ECGs to the number of abnormal Ras, I.e. (17+9)/19 = 1.37
• The problem is that the 7 patients whose ECGs and RAs are both abnormal are counted twice
193
4.2.Conditional probability• Are probabilities based on the knowledge that
some event has occurred.• In the retinopathy study described in the above
example, the primary concern is comparison of the bright-light infants with the reduced-light infants. We want to know whether the probability of retinopathy for the bright-light infants differs from the probability of retinopathy for the reduced-light infants.
• We want to compare the probability of retinopathy, given that the infant was exposed to bright light, with that the infant was exposed to reduced light. – Exposure to bright light and exposure to reduced light are
conditioning events, events we want to take into account when calculating conditional probabilities.
194
Conditional…
• Conditional probabilities are denoted by P(A/B) (read as Probability of A given B )or P(Event/Conditioning event). The formula for calculating a sample conditional probability is:
P (Event/Conditioning event) = No. of observations for which event and conditioning event both occur
No. of observations for which conditioning event occurs
P(A/B)= P(A∩B) , if P(B)>0 P (B)
195
Conditional…
• Example: For the retinopathy data, the conditional probability of retinopathy, given exposure to light, is
P (Retinopathy/exposure to bright light)= No. of infants with retinopathy exposed to bright light
No. of infants exposed to bright light
= 18/21= 0.86 P(Retinopathy/exposure to reduced light)= No.of infants with retinopathy exposed to reduced
light No. of infants exposed to reduced light
= 21/39 = 0.54 • The conditional probabilities suggest that premature
infants exposed to bright light have a higher risk of retinopathy than premature infants exposed to reduced light.
196
Summary of formulas for calculating probability
Summary of formulas for calculating probability.doc
More exercises
197
Calculating probability of an event Table --- shows the frequency of cocaine use by gender among adult cocaine users _______________________________________________________________________________________________
Life time frequency MaleFemale Total of cocaine use _______________________________________________________________________________________________
1-19 times 32 7 39 20-99 times 18 20 38 more than 100 times 25 9 34 -------------------------------------------------------------------------------------------- Total 75 36 111 ---------------------------------------------------------------------------------------------
198
Questions
1. What is the probability of a person randomly picked is a male?
2. What is the probability of a person randomly picked uses cocaine more than 100 times?
3. Given that the selected person is male, what is the probability of a person randomly picked uses cocaine more than 100 times?
4. Given that the person has used cocaine less than 100 times, what is the probability of being female?
5. What is the probability of a person randomly picked is a male and uses cocaine more than 100 times?
199
Answers
1. Pr(m)=Total adult males/Total adult cocaine users =75/111 =0.68 .
2. Pr(c>100)=All adult cocaine users more than 100 times/ Total adult cocaine users=34/111=0.31.
3. Pr (c>100\m)=25/75=0.33.4. Pr(f\c<100)=(7+20)/36=27/36=0.75.5. Pr(m ∩ c>100)= Pr(m) × Pr (c>100)=75/111×25/75=25/111=0.23.
200
4.3.Normal distribution
• If we take a large sample of men or women, measure their heights, and plot them on a frequency distribution, the distribution will almost certainly obtain a symmetrical bell-shaped pattern known as the normal distribution (also called the Gaussian distribution).See the following fig.
• The least frequently recorded heights lie at the two extremes
of the curve. From the figure,it can be seen that very few women are extremely short or extremely tall.
201
Normal distribution…
Figure3.4.Distribution of a sample of values of women's heights.
202
Normal distribution…• In practice, many biological measurements follow
this pattern, making it possible to use the normal distribution to describe many features of a population.
• It must be emphasized that some measurements do not follow the symmetrical shape of the normal distribution, and can be positively skewed or negatively skewed.
• For example, more of the populations of developed Western countries are becoming obese. If a large sample of such a population's weights was to be plotted on a graph similar to that in Figure3.4.1. above, there would be an excess of heavier weights which might form a similar shape to the 'negatively skewed' example in Figure3.4.2. below.
• The distribution will therefore not fit the symmetrical pattern of the normal distribution.
203
Normal distribution…
Figure 3.4.2.Examples of positive and negative skew.
204
Normal distribution…
Fig.3.4.3.The Normal distribution
205
Normal distribution…From The normal distribution shown in Figure3.4.3.
above, You can see that it is split into two equal and identically shaped halves by the mean.
• The standard deviation indicates the size of the spread of the data. It can also help us to determine how likely it is that a given value will be observed in the population being studied. We know this because the proportion of the population that is covered by any number of standard deviations can be calculated.
• The proportions of values below and above a specified value (e.g. the mean) can be calculated, and are known as tails.
• The normal distribution is useful in a number of applications, including confidence intervals and hypothesis testing .
206
Properties of the normal distribution
1. It is symmetrical about its mean, μ.2. The mean, the median and mode are all equal3. The total area under the curve above the x-axis is
one square unit.4. The curve never touches the x-axis.5. As the value of σ increases, the curve becomes
more and more flat and vice versa.6. About 68% of the values of X fall within one
standard deviation of the mean, 95% of the values are found within two standard deviations of the mean and 99.7% of the values are found within three standard deviations of the mean.
7.The distribution is completely determined by the parameters μ and σ.
8.The mean is μ and the variance is σ2207
Standard normal distribution
• It is a normal distribution that has a mean equal to 0 and a standard deviation equal to 1.
• Z-transformation: If a random variable X~N(μ,σ) then we can transform it to a standard normal distribution with the help of Z-transformation
Z = X -μ σ
208
209
Example1 • In 1932 the Stanford-BinetIQ test was roughly
normally distributed with μ= 100 and σ= 15.• Over time IQ’s have increased (better nutrition
or more experience taking test??) so average IQ for present day American children taking the 1932 test would be 120 but with same σ .
• “Very Superior" is an IQ above 130.
– (a)What % of 1932 children were “very superior”?– (b) What % of present day children would be “very superior” on 1932
test?
210
SolutionLet X be 1932 IQ scores & Let Y the scores of present day children on the
1932 test.X ~ N(100, 15) & Y ~ N(120, 15)(a)P(X >130) = P(Z > (130 -100)/15) = P(Z >2.0) = 0.0228( fromZ table.) 2.28 % of 1932 children were “very superior”(b) P(Y>130) = P(Z > (130-120)/15) = P(Z > 0.67)
= 0.2514 25.14% of present day children are “very superior
211
Example 2• A data collected on systolic blood pressure in
normal healthy individuals is normally distributed with μ= 120 and σ= 10 mm Hg.
1)What proportion of normal healthy individuals have a systolic blood pressure above 130 mm Hg?
2)What proportion of normal healthy individuals have a systolic blood pressure between 100 and 140 mm Hg?
3)What level of systolic blood pressure cuts off the lower 95% of normal healthy individuals?
212
Solutions
213
214
215
4.4.The Binomial distribution
It is one of the most widely encountered discrete distributions.
•The origin of binomial distribution lies in Bernoulli’s trials. When a single trial of some experiment can result in only one of two mutually exclusive outcomes (success or failure; dead or alive; sick or well, male or female) the trail is called Bernoulli trial.
• Suppose an event can have only binary outcomes A and B. Let the probability of A is π and that of B is 1 -π. The probability π stays the same each time the event occurs.
• If an experiment repeated n times and the outcome is independent from one trial to another, the probability that outcome A occurs exactly X times is:
216
The Binomial distribution…
217
Characteristics of a Binomial Distribution
•The experiment consist of n identical trials.•There are only two possible outcomes on each
trial. •The probability of A remains the same from trial to
trial. This probability is denoted by p, and t he probability of B is denoted by q. Note that q=1-p.
•The trials are independent.•The binomial random variable X is the number of
A’s in n trials.•n and π are the parameters of the binomial
distribution.•The mean is nπ and the variance is nπ(1-π)
218
Exercise(Home work)
• Each child born to a particular set of parents has a probability of 0.25 of having blood type O. If these parents have 5 children.
• What is the probability that :
a. Exactly two of them have blood type Ob. At most 2 have blood type Oc. At least 4 have blood type Od. 2 do not have blood type O.
219