The Role of Statistical Analysis within a Broader Research Methodology Simon French...

Post on 31-Mar-2015

219 views 1 download

Tags:

Transcript of The Role of Statistical Analysis within a Broader Research Methodology Simon French...

The Role of Statistical Analysis within a Broader

Research Methodology

Simon Frenchsimon.french@warwick.ac.uk

Research projects differ• Some seek to explore issues

– field studies vs laboratory studies

• Some seek to confirm or disprove hypotheses– field studies vs laboratory studies

• Some seek to critically evaluate an area• Some seek to solve a problem and

implement a solution• Some design and implement new systems• Some seek to develop new theory or

algorithms

(Social)Sciences

Engineering & mathematics

StatisticsThere are lies, damn lies, and overused quotations

A statistician is someone who wanted to be an accountant but did not have the charisma.

Anon

Statistical thinking

will one day be as

necessary for efficient citizenship

as the ability to read

and write.H.G. Wells

It is the function of the statistical

method to emphasise that precise

conclusions cannot be drawn

from inadequate data.

E.S. Pearson and H.O Hartley

A witty statesman once said that you

might prove anything by figures (but)

a judicious man looks at statistics not

to get knowledge but to save himself

from having ignorance foisted upon

him.Thomas Carlyle

He uses statistics as a drunk uses a street

lamp, for support rather than illumination.

Andrew Lang

Statistics ...

• ... is the analytic heart of scientific research and inference

• It is not a numerical add-on; nor should it be seen as a hurdle to publication

So off to the Welsh Valleys!

Cynefin: a Welsh habitat

Cause and effect can be determined with

sufficient data

Knowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

Known The realm of Scientific

Knowledge Cause and effect understood

and predicable

D. Snowden (2002). "Complex acts of knowing - paradox and descriptive self-awareness." Journal of Knowledge Management 6 pp. 100-11.

Cynefin:• physical environment• cultural environment• social environment• historical environment• …..

Cynefin: a Welsh habitat

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable

D. Snowden (2002). "Complex acts of knowing - paradox and descriptive self-awareness." Journal of Knowledge Management 6 pp. 100-11.

Cynefin:• physical environment• cultural environment• social environment• historical environment• …..

Cynefin: a Welsh habitat

Cause and effect can be determined with

sufficient data

Knowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

Known The realm of Scientific

Knowledge Cause and effect understood

and predicable

D. Snowden (2002). "Complex acts of knowing - paradox and descriptive self-awareness." Journal of Knowledge Management 6 pp. 100-11.

Cynefin:• physical environment• cultural environment• social environment• historical environment• …..

Cynefin: a Welsh habitat

Cause and effect can be determined with

sufficient data

Knowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

Known The realm of Scientific

Knowledge Cause and effect understood

and predicable

D. Snowden (2002). "Complex acts of knowing - paradox and descriptive self-awareness." Journal of Knowledge Management 6 pp. 100-11.

Cynefin:• physical environment• cultural environment• social environment• historical environment• …..

Cynefin: a Welsh habitat

Cause and effect can be determined with

sufficient data

Knowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

Known The realm of Scientific

Knowledge Cause and effect understood

and predicable

D. Snowden (2002). "Complex acts of knowing - paradox and descriptive self-awareness." Journal of Knowledge Management 6 pp. 100-11.

Cynefin:• physical environment• cultural environment• social environment• historical environment• …..

Cynefin: a Welsh habitat

Cause and effect can be determined with

sufficient data

Knowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

Known The realm of Scientific

Knowledge Cause and effect understood

and predicable

D. Snowden (2002). "Complex acts of knowing - paradox and descriptive self-awareness." Journal of Knowledge Management 6 pp. 100-11.

Cynefin:• physical environment• cultural environment• social environment• historical environment• …..

Learning and knowledge

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable

Knowledge Management and Nonaka’s SECI

tacit knowledge

explicit knowledge

Socialisation

Internalisation

Combination

Externalisation

12

The practice of

Science and

research

Sense-making and articulation is as important to Science and research

Cynefin and Knowledge Management

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable

Tacit KnowledgeJudgement/expertise

Explicit Knowledgee.g. Scientific Models

Applications of Cynefin• Emergency Management• Categorisation of DSS and OR/DA techniques• Human Reliability Analysis• High Reliability Organisations• Knowledge Management• Sensemaking• Research Methodology

S. French (2012) ‘Cynefin, Statistics and Decision Analysis’. Journal of the Operational Research Society. In press http://www.palgrave-journals.com/jors/journal/vaop/ncurrent/full/jors201223a.html

Cynefin: learning, repeatability

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable Repeatability and increasing familiarity

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable

Cynefin and data collection

Exper

imen

ts

and

trials

Case

studie

s,

inter

views,

and

surv

eys

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable

Cynefin and statistics

Repea

table

even

ts

Uniqueevents

Events?

Estim

ation

and

conf

irmat

ory

analy

sis

explo

rato

ry

analy

ses

Cause and effect can be determined with

sufficient data

K nowable The realm of

Scientific Inquiry

Complex The realm of Social Systems

Cause and effect may be determined after the event

Chaotic Cause and effect not discernable

K nown The realm of Scientific

Knowledge Cause and effect understood

and predicable

Cynefin and statistics

Repea

table

even

ts

Uniqueevents

Events? Actuall

y you

nee

d

explo

rato

ry st

atist

ics

here

to ch

eck t

hat y

ou

reall

y are

in th

e kn

own

or kn

owab

le sp

ace

19

Exploratory analyses• Look at the data

– In any, repeat any analysis, look at the data– It is too easy for data to pass from web

questionnaire to Excel to SPSS to analysis without your looking at the data.

• Simple plots and tables– Tables – do not think them ‘simple’ to construct!– Histograms, Boxplots, Scatterplots, …

• Useful in presenting results too• Generally easy to produce with Excel or SPSS

– If you know what you are trying to achieve– Data mining and data visualisation

Estimation and Confirmatory Analyses

• Based on statistical models– If your experiment needs statistics you should

have done a better experiment ­­WRONG!• Estimation

– Point estimates– Confidence intervals

• Hypothesis tests

Data collection protocol• You need one!!!• Formal theory of experimental design

– How many and which data to collect– Mix of theoretical requirements for accuracy and pragmatism

• But wider than that you need to plan in advance many things about how you will gather your data, be it qualitative or quantitative.

• It is vital that you record your planning and your reasoning.– You will not remember when you come to write your thesis/paper

• You also need a data storage protocol– Keep original data not summaries if you can

• Sufficient statistics are for the theoreticians

– Keep a geographically separate copy for security purposes

Check assumptions• Independence. Usual to assume that the data points are

sampled independently so that– x1, x2, …, xn are independent and identically distributed (iid)

• Think about distributional assumptions– Parameters known?– Normal??? Maybe as approximation but check!– Do not make assumptions on the grounds that the text book gives

a statistical test for those assumptions • Ideally repeat analysis under different assumptions

– Sensitivity analysis• Outliers

– Some recommend removing data that is ‘clearly an outlier’– My view: a bad scientist blames his data – so discard data at your

peril– If you must remove outliers, document reasons and make sure they

are good.• If you cannot see the result in the data (simple plots) and/or it

does not make qualitative sense, question it!

Value focused thinking• “Values are what we care about. As such, values should be

the driving force for our decision making. They should be the basis for the time and effort we spend thinking about decisions. But this is not the way it is. It is not even close to the way it is.”

Keeney (1992)

• Define objectives, research questions, hypotheses at outset– (probably modify pragmatically as research progresses!)– More creative in research design

• Focuses attention on what matters• Helps identify the ‘right’ research/problem solving methodology

Note: whether we talk of objectives, research questions, hypotheses depends on type of research project

Thank you

Back up Slides

26

Tables and Charts• Clarify in titles and notes

– What the data are and where they come from– Units

• 2 or 3 ideas can be shown/explored in a table or chart … no more– Do not make over ‘busy’

• x’s not dustbins for data on waste!– Do not introduce spurious features

• E.g. number the data and accidentally introduce a ranking• Watch for cognitive aspects

– Appropriate scales– Appropriate number of significant figures– In tables: put important variation down the columns– Use of colour

• red-green bad (‘stop’) and good (‘go’) or just colour blind

Regression and Factor Analysis as exploratory analyses

• Often (usually!!!) data is multi-dimensional • It is difficult to see the key trends and

variations by eye• Regression and factor analyses reduce

dimensions to the ‘significant’ ones

Regression Analysis

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Describe the cloud of data points16 (x,y) points = 32 numbers

Regression Analysis

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Describe the cloud of data pointsRegression line: y = mx + cPlus standard deviation3 numbers …Trend, base case, and spread

30

Factor Analysis

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Describe the cloud of data points16 (x,y) points = 32 numbers

Factor Analysis

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Describe the cloud of data pointsProject each point onto regression line16 numbersKeeps each item separate in summary

Regression and Factor Analysis

• Here we have reduced 16 points in 2 dimensions onto 1 dimension (a line)

• Generally reduce a lot of points in high dimension onto many fewer dimensions

• More general methods known as multivariate analysis– Regression analysis– ANOVA– Factor Analysis– Principal components– Multi-dimensional scaling – ….

Ordinal and Interval Data• Sometimes data only contains ranking

information– Such data is called ordinal data

• Other times the data is measured against a scale with an origin and a unit– Such data is called interval data (or cardinal)

• Most of the methods of multivariate analysis assume interval data– But they work with ordinal data if you take them

with a pinch of salt! (and do not believe or quote significance levels, etc.)

– Read the assumptions behind the methods when using SPSS or similar.

Estimation

Try to find a function of the data that is tightly distributed about the quantity of interest.

Distribution of data

datapointQuantity of interest Distribution of mean

Quantity of interestData mean

Confidence intervalsintervals defined from the data

95% confidence intervals: calculate interval for each of 100 data sets

about 95 will contain .

Hypothesis testingHypothesis test: general

– Compare a null hypothesis H0 and an alternative H1

– Type 1 error: reject H0 when H0 true

– Type 2 error: do not reject H0 when H1 is true• Note never say accept an hypothesis! Best phrasing is

“there is/is not significant evidence against H0”

– Significance level is probability of type 1 error– Power is probability of type 2 error.– Conventionally significance level is set as

5% (significant) or 1% (highly significant)– Define g(x) and a critical region such that the

probability that g(x) lies in the region is less than the significance level if H0 is true

Hypothesis testing

• Note that 5% significance level means that 1 in 20 tests will result in a type 1 error and reject H0 when it is true.

• Thus if you perform lots of tests in your research you will necessarily make lots of mistakes!!!!!

• There are theories of multiple testing to help avoid misinterpretation in such cases

Meta-Analysis• Often there are several related studies in the

literature– Datasets collected under ‘similar’ conditions– Analysis of ‘similar’ research questions

• How do we combine their results and conclusions?• Key point: literature bias

– Insignificant results not published– Some authors cited more often and easier to find than others

• Assumptions of analysis often not fully clear– Data collection procedure– Outliers? Raw data or outliers discarded?

Meta Analysis: key points• Plan it and define a protocol before beginning

– Just as you would define any other data collection procedure• Define criteria for inclusion of studies a priori and use

these to guide a deep and detailed literature search.• Plot the different data sets on the same scales and

‘eyeball’ them– Explore these data just as you would an experimental data

set• No ‘right’ method for combining analyses so try

several if possible and look for common conclusions (or explain differences in terms of different assumptions)

• Check sensitivity and robustness of your combined conclusion as in any other analysis