Matthews-2015-Asia & the Pacific Policy Studies · capacity-building challenges faced in the Asia...

Original Article

Using Signal Processing Diagnostics to Improve PublicSector Evaluations

Mark Matthews*

Abstract

False positive test results that overstate inter-vention impacts can distort and constrain thecapability to learn and adapt in governance,and are therefore best avoided. This articleconsiders the benefits of using the Bayesiantechniques used in signal processing andmachine learning to identify cases of thesefalse positive test results in public sectorevaluations. These approaches are increas-ingly used in medical diagnosis—a context inwhich (like public policy) avoiding false posi-tive and false negative test results in the evi-dence base is very important. The findingsfrom a UK National Audit Office review ofevaluation quality are used to illustrate how aBayesian diagnostic framework for use inpublic sector evaluations could be developed.

Key words: evaluation, Bayesian, gover-nance, capacity-building, signal processing

1. Introduction

This article addresses the public policycapacity-building challenges faced in the Asiaand the Pacific and suggests one new approachthat may be useful to practitioners. The Asiaand the Pacific is a region of growing dyna-mism in public policy as national governmentsseek to adapt to rapid rates of economicgrowth and resulting social changes and envi-ronmental challenges. These major transfor-mations focus attention on the capacity ofgovernments to learn and adapt effectivelyunder conditions of such rapid change.

At a global scale, one of the assumed differ-ences between governance in developed anddeveloping countries is the demonstratedcapacity to abandon policy interventions thatare not working well. Indeed, in diagnosticterms, failures to learn and adjust in policyimplementation are a key indicator of lowcapability levels. Efforts to foster this adaptivecapacity in governance are driven by a mix ofpolitical will and by the technical capability ofgovernments to monitor, evaluate and learnfrom past and current interventions. It there-fore makes sense for governments in the Asiaand the Pacific to consider the extent to whichtheir capacity to learn and adapt is as good asit could be—and how this capacity could beimproved. This requires diagnostic tools thatare ‘fit for purpose’ and the ability to use suchtools effectively within the public sector.

In the countries of the Organisation forEconomic Cooperation and Development(OECD), the formal monitoring and evaluation

* Australian Centre for Biosecurity and Environ-mental Economics, Crawford School of PublicPolicy, The Australian National University,Australian Capital Territory 0200, Australia; [email protected]".

bs_bs_banner

Asia & the Pacific Policy Studies, vol. ••, no. ••, pp. ••–••doi: 10.1002/app5.110

© 2015 The Author. Asia and the Pacific Policy Studiespublished by Crawford School of Public Policy at The Australian National University and Wiley Publishing Asia Pty Ltd.This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, whichpermits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for

commercial purposes.

(M&E) frameworks that are, in theory, a keydriver of policy learning can be excessively‘administrative’ and focused on compliancewith funding contracts and/or standards andguidelines rather than on maximising opportu-nities to learn-by-doing in policy implementa-tion.1 Indeed, there is currently a notabledisconnect between the factors that drivepolicy forward and the administrative practicesand procedures that are used to manage andlearn how to deliver policy more effectively.

This disconnect means that the ways inwhich OECD governments handle evaluation(and use evidence more generally) in thecontext of their broader public sector reformagendas are not necessarily a capability thatAsia and the Pacific nations are wise toemulate. It would be wiser to explore capacity-building approaches with the potential to jumpbeyond current ‘best practices’. A focus onusing Bayesian inference, framed in signalprocessing terms, is one potential approach toarticulating this sort of capability ‘leapfrog’strategy. As such, the approach outlined in thisarticle could contribute to broader efforts tostrengthen ‘developmental evaluation’—astance that facilitates learning and evolution inan uncertain and changing world, Patton(2010).

2. Defining the Problem

Arguably, one reason for the disconnectbetween the factors that drive policy forwardand the administrative practices and proce-dures that are used to manage the delivery ofpolicy is that policy-makers do not currentlyhave access to a generic and comprehensiveanalytical model of the policy learning andadaptation process specifically designed tofacilitate capacity-building via a focus on iden-tifying inaccurate test results. While there areidealised models of the policy formulation anddelivery cycle, most practitioners recognisethat these are very much ‘ideal types’ that

guide and inform real practices—but may notdirectly assist with those real practices.2

The reasons for this are a combination ofpolitical expediency, urgency in responding tounexpected surprises and other factors. Theresult is that these models of the policyprocess, like the M&E function that appears indifferent guises as a stage in this cycle, tend toplay a ritualistic and referential role at somedistance from actual practice.

Arguably, however, the complexity andrequirements for detail that practitioners tendto encounter when attempting to use M&Eframeworks are an impediment to effectivepolicy learning. M&E is a specialist activitywith its own distinctive jargon and methods(log frames, theories of change etc.). Thiscomplexity can be even harder to cope with insituations in which technically sophisticatedstatistical analyses are used (including econo-metric studies). The methodologies appliedcan themselves be a source of ambiguity andconfusion to officials without specific techni-cal expertise and experience.

Current approaches tend to result in longand complex lists of evaluation parametersthat make it hard to take a top-down viewthat gets to the heart of what has really hap-pened. It is not unusual for key conclusionsfrom evaluations to be watered down forpolitical expediency simply by editing reportsand via subtle changes in wording. Thisaspect can increase cynicism among general-ists in government, and as result get in theway of identifying key lessons for movingforward. This is especially the case when anM&E framework is excessively focused oncomplying with a range of specific contrac-tual performance indicators—leading to thefamiliar problem of ‘managing to contract’vis-à-vis ‘managing for results’.

These characteristics tend to result in ade-coupling of the pragmatic (and oftenchaotic) process of doing real public policyand the frameworks that are taught to newgovernment officials but only used ideallyand or/when people have the chance. This1. Sabel and Zeitlin (2012) consider the potential of

‘experimentalist’ governance as an explicit approach tolearning-by-doing that allows for very general outcomes tobe set at the outset of an intervention followed by refine-ment via implementation.

2. See Althaus et al. (2007) for a discussion of the policycycle.

Asia & the Pacific Policy Studies •• 20152

© 2015 The Author. Asia and the Pacific Policy Studiespublished by Crawford School of Public Policy at The Australian National University and Wiley Publishing Asia Pty Ltd

de-coupling is not helpful for M&E activitiesbecause it limits the feedback from non-specialist policy officers—a feedback loopthat, if configured appropriately, has thepotential to force a simplification andstandardisation of methodologies via learning-by-doing. By implication, a standardised andsimplified approach that avoids keeping M&Emethods in a silo and ‘mainstreams’ generallearning and adaptation in the policy processcould be of benefit to the practice of publicpolicy (see Matthews 2015).

The purpose of this article is to propose apossible solution to the challenge of develop-ing an evaluation diagnostic able to cope betterwith the real policy conditions of sparse,ambiguous and possibly confusing informa-tion. To this end, methodological insights fromsignal processing and machine learning basedon adopting a binary (true or false) characteri-sation of hypothesis-based evaluation findingsis explored. This binary framework (widelyused in clinical diagnosis) is framed in termsof the estimated likelihoods of obtaining falsepositive and false negative test results in evalu-ation studies—based on the combination ofspecific evaluation test results and what isknown about the more general prevalence oftest inaccuracies. The suggestion is that this isan approach that could also, potentially, sim-plify and improve the utility of M&E frame-works by integrating that specific activity morefully into the broader policy learningprocess—including linking more effectivelywith the challenge of how uncertainty and riskare managed (uncertainties and risks can stemfrom the use of inaccurate test results).

3. Policy Learning and InformationTheory

In an uncertain and risky world, governmentsshould seek to maximise the availability ofdecision-support information they have accessto. If they do not, then there is a risk that abetter decision could have been made—givenwhat is currently understood and assumed tobe the case. While information will always beimperfect, there are, of course, degrees of

imperfection that can have a major impact onpolicy decisions.

In this context, it is useful to consider theimplications of Shannon’s seminal work onthe mathematics of information (Shannon1948).3 Shannon distinguished between infor-mation and uncertainty (defined as entropy)and expressed the value of new informationthat might be received in terms of the assumedlikelihood of an event happening and beingobserved. In that framework, which has beenincredibly useful in information technology,4

the less likely an event is assumed to be, thegreater the information gain if it is observed.Both Claude Shannon and Alan Turing drewon Bayesian thinking in their WW2 code-breaking work (see McGrane 2011).5 Shan-non’s method for calculating informationentropy is effectively a measure of the poten-tial for surprise when analysing informationstreams. In a machine learning context, whennew information is a surprise (unexpected),then this can be used to trigger adaptiveresponses in models of reality aimed at reduc-ing that potential for surprise in the future. Theparallels with desirable characteristics inpublic sector evaluation are clear. Turing usedbetting odds to allocate the next day’s code-breaking efforts, in what amounted to a Bayes-ian approach, because this had mathematicaladvantages over conditional probability for-mulations (Good 1979; Gillies 1990).

Information theory can be applied to missedopportunities to learn in public policy. Fromthis perspective, governments should aim tominimise the extent to which the assumptionsthey hold over the range of wanted andunwanted events (i.e. risks and opportunities)are distorted by failures to use all availableinformation to calculate the odds they are

3. A non-technical exposition of Shannon’s seminal con-tribution to the development of the mathematics of infor-mation can be found in Pierce (1961).4. The mathematics of information provides the basis forcryptography, data compression, communication errordetection and many other aspects of information and com-munication technology (ICT)—plus, most recently, themachine learning algorithms used in artificial intelligence(AI).5. A useful review of McGrane’s book can be obtainedonline in Dale (2012).

Matthews: Signal Processing Diagnostics 3


working with. This is not to argue that thisavailable information will always be adequatefrom the perspective of idealised models andframeworks (not least the neoclassical eco-nomic thinking that tends to pervade gover-nance): it will always be limited and oftensubjective. But it is to argue that ignoring oroverlooking available information can result ina mix of lost opportunities to achieve wantedoutcomes and misjudged risks of avoidingunwanted outcomes—the calculated oddsdiverge from the data upon which these oddsshould be based. In information theory terms,governments are wise to seek to avoid situa-tions in which they are surprised by eventssimply because information that would haveled them to calculate relevant odds was ignored.

4. The Current Situation

We use the United Kingdom’s experience as anillustration of these generic challenges becausethere is an established (and influential) historyof promoting evidence-based policy-makingmatched with significant attention paid to themethodological challenges involved in actuallydelivering this approach in practice.

The British Government currently uses theROAMEF definition of the policy learningcycle, illustrated in the following diagram.This comprises the following distinct func-tions: rationale, objectives, appraisal, monitor-ing, evaluation and feedback (HM Treasury2011) (Figure 1).

That is the theory. Actual practice is ratherdifferent. In 2013 the UK National AuditOffice (NAO) reported on a major assessmentof the adequacy of evaluations of UK govern-ment programs. The findings were striking inwhat they revealed about what is not beingdone to assist policy learning via robust evalu-ations. While the UK NAO findings had anoticeable impact on public sector evaluationpractices in the United Kingdom (leading togreater attention being paid to using morerobust methodologies), these findings, andtheir implications, are of more general rel-evance outside of the United Kingdom asregards highlighting problems and framingwhere to search for solutions.

The NAO reviewed nearly 6,000 analyticaland research documents published between2006 and 2012 on 17 main government depart-ment websites. They found that only 305 ofthese were impact evaluations, and that ofthese 305 only 70 made an assessment of cost-effectiveness. The NAO were able to identify£12.3 billion of overall program expenditure(in cash terms) evaluated by 41 of those evalu-ations (NAO 2013). Furthermore, the NAOfound that in ‘only 14 evaluations were of asufficient standard to give confidence in theeffects attributed to policy because they had arobust counterfactual’ (NAO 2013, p. 7).

The main message to emerge from theNAO’s assessment was that the evaluationfunction is not delivering what it shoulddeliver—robust conclusions on what isworking well and what is working less well, oreven badly—and why this is the case.

One noteworthy aspect of UK experience isthe strong emphasis now placed on approachesto evidence-based policy-making (like theconcept itself) that originated in health andmedicine. This is resulting in the growing useof randomised control trials (RCTs) and thepromotion of a hierarchy of evidence. This

Figure 1 Diagrammatic Representation of the UKROAMEF Policy Learning Framework

Source: Author’s adaptation from HM Treasury (2011).



hierarchical approach is explicitly reflected inthe use of the Maryland Scientific MethodsScale by the NAO, see Madelano and Waights(2015) for a discussion of this scale in anevaluation context. This framework isexpressed graphically (by the author) inFigure 2 and detailed in Table 1, and is basedupon identifying five categories of scientificevidence—the most powerful of which areRCT studies.

The Maryland scale has been used in theUnited Kingdom to assess the quality of evi-dence used in public policy. Of 33 publishedevaluations in four policy areas examined by theNAO (spatial policy, labour markets, businesssupport and education), three (8.8 per cent) con-formed to Tier 5 quality evidence, eight (23.5per cent) conformed to Tier 4, and three (8.8per cent) to Tier 3. Thirteen conformed to Tier 2(38.2 per cent) and six to Tier 1 (5.8 per cent). Inother words, 44 per cent fell within the ‘weaker/riskier research designs’ category.

In the NAO’s view, this skewed diversity inevaluation quality is a matter of concern becauseit indicates that the UK government departments

should make a greater effort to conduct evalua-tions using high confidence methods (they notedthat weaker methods tended to be associatedwith more positive conclusions).

Two observations can be made about thisNAO assessment. First, the growing emphasison drawing attention to the robustness of evalu-ation methods is a significant development—especially as regards capacity-building. If policyis to be informed by robust evidence, then it willtake a significant effort (and possibly cost) toachieve high evidence quality thresholds.Second, the type of approach being put in placein the United Kingdom may be effective in ret-rospective assessments and in driving robust(RCT style) assessments for future use—but ispoorly positioned to assist policy-makers forcedto made decisions (as is often the case) whenavailable information is sparse, and/or ambigu-ous and therefore confusing.

While systematic reviews of multiple evalu-ations in public policy could provide the robustpattern-based evidence required, such compre-hensive assessments are rarely carried out. Inpart, this may be because practitioners lack a

Figure 2 Diagrammatic Illustration of the Maryland Scientific Methods Scale

Source: Author’s interpretation of the Maryland scale.



compelling and robust analytical frameworkfor doing systematic reviews of evaluations. Inany case, Maryland scale-based approachesare of limited relevance to fast-moving andhigh uncertainty challenges (e.g. crisismanagement)—a particular issue given gov-ernment’s distinctive role as uncertainty andrisk manager ‘of last resort’ (i.e. handling theuncertainties and risks that markets, businessesand civil society cannot cope with).6

Consequently, it makes sense to investigatewhether an explicitly Bayesian ‘risk aware’approach based on testing competing hypoth-eses might provide a solution to the challengeof developing an approach to evaluation thatprovides clearer picture of success andfailure, and also incorporates uncertainty andrisk into that assessment.

5. Public Policy as a Bayesian HypothesisTesting Process

Conceptually, any government policy inter-vention can be treated as a hypothesis beingtested through learning-by-doing, whether thisexperimental aspect is implicit or explicit (e.g.as a pilot project). From this perspective, allpolicies are in effect hypotheses because actualoutcomes are uncertain (for a range ofreasons), and they are therefore ‘tested byexperience’ (see Matthews 2016, forthcom-ing). This hypothesis-testing approach alignswell with the ‘risk-aware’ perspective thatviews public policy as a matter of investing inimproving the odds of obtaining outcomes thatwe want and reducing the odds of outcomesthat we do not want. This focus on hypothesesas an articulation of policy stances also high-lights the importance of creativity in the policyprocess—the source of useful ideas that are

6. See Matthews and Kompas (2015) for a discussion ofthe use of simplified Bayesian analysis in public sectorrisk management.

Table 1 Definitions in the Maryland Scale

Maryland Scale Government Guidance (QIPE) Maryland Scale Government Guidance (QIPE)

Strong research designs in the measurement ofattribution

Level 5—Random assignment and analysis ofcomparable units to program and comparisongroups

Random allocation/experimental design. Individuals or groupsare randomly assigned to either the policy intervention ornon-intervention (control) group and the outcomes of interestare compared. There are many methods of randomisation fromfield experiments to randomised control trials.

Level 4—Use of statistical techniques to ensurethat the program and comparison group weresimilar, and so fair comparison can be made

Quasi-experimental designs. Intervention group vs well-matchedcounterfactual. Outcomes of interest are compared between theintervention group and a comparison group directly matched tothe intervention group on factors known to be relevant to theoutcome.

Level 3—Comparison between two or morecomparable groups/areas, one with and onewhich does not receive the intervention.

Strong difference-in-difference design. Before and after studywhich compares two groups where there is strong evidence thatoutcomes for the groups have historically moved in parallelover time.

Weaker/riskier research designs in themeasurement of attribution

Level 2—Evaluation compares outcomes beforeand after an intervention, or makes acomparison of outcomes between groups orareas that are not matched

Intervention vs unmatched comparison group. Outcomescompared between the intervention group and a comparisongroup.

Level 1—Evaluation assesses outcomes after anintervention—but only for those affected; nocomparison groups used

Predicted vs actual—Outcomes of interest for people or areasaffected by policy are monitored and compared with expectedor predicted outcomes.

No comparison group—A relationship is identified betweenintervention and outcome measures in the intervention groupalone.

Source: Maryland Scientific Methods Scale.



then framed as hypotheses to be tested viaexperimentation and implementation.

It follows that the greatest clarity possible indesigning policy is to express it in terms ofhypotheses—and to use data gained from prac-tical experience to test those hypotheses Themost robust approach analytically is to set upcompeting hypotheses because this reduces therisk of misdiagnosis (competing hypothesesshould be eliminated via analytical progress).7

If data could have been used to update cal-culations of the relative odds that differentcompeting hypotheses are correct but was notused, then those odds are subject to unneces-sary bias. Similarly, if data are analysed in amanner that results in a false positive or falsenegative test result, then the decisions made onthe basis of biased odds in hypothesis tests areless robust than decisions made on the basis ofunbiased odds. The resulting bias in the oddsof competing hypotheses being true impactsupon the more general ability of governmentsto find ways of improving the odds of wantedoutcomes and reducing the odds of unwantedoutcomes.

The existence of substantive uncertainty, initself, is inevitable and can be factored intocalculated odds in hypothesis tests usingBayesian methods (increased uncertaintysimply decreases the odds that a hypothesis istrue). The principle in Bayesian analysis (uponwhich information theory is based) is that weare able to update the assumed odds of some-thing happening when new information isobtained. The new information may eitherconfirm that the initially assumed odds shouldbe retained, or may lead us to revise theseodds, or importantly to develop updatedhypotheses to test more likely to explain thedata.

From this perspective, governments’ M&Eactivities will have the greatest utility when theinformation obtained as a result of experiencein implementing a policy intervention isrelated back to an initial (uncertainty and risk

based) assumption of the odds of successassumed for the intervention and expressed intestable hypotheses. If M&E measures are notbased on an explicit recognition of uncertaintyand risk (i.e. the odds of success are not madeexplicit at the outset), then it is unlikely thatuseful learning relating to uncertainties andrisks will be captured and used even if usefullearning takes place. This is because in manycurrent approaches, uncertainty and risk aremarginalised rather than centralised in theanalysis because these key factors are treatedas impediments to achieving well-definedobjectives rather than being themselves thefocus of investment in achieving beneficialchange (as is evident in standards such as ISO31000: 2009).8

6. Using Natural Frequencies to Simplifythe Use of Bayes Rule9

The broad principle behind Bayesian inference(that the odds of different hypotheses beingtrue can be updated when new information isreceived) is easily grasped. However, theexpression of this test result update via theusual approach of conditional probabilities isneedlessly complex. As Gigerenzer (2002) hasstressed, if we need to calculate the impact offalse positive and false negative test results(which is especially important in areas likeclinical diagnosis and also, as stressed here, inpublic policy), then it is both simpler and moreintuitive to use ‘natural frequencies’. These arethe raw counts of observations that reflectscale (how large the numbers of observationsare relative to each other) and are notnormalised as probabilities.10

Highly trained professionals involved indiagnostic activities, including experienced

7. This is a key aspect of clinical diagnosis that is morerarely found in public policy—where there can be a leap toparticular assumed diagnoses and solutions without thisprocess of elimination.

8. In ISO 31000: 2009, the ways in which more of lessdemanding objectives can be set according to uncertaintyand risk management capabilities are not a major focus ofattention. In reality, an organisation devotes great attentionto setting objectives on the basis of assumed capabilitylevels (this is a familiar feature of industrial innovation).9. This section draws upon the discussion of the naturalfrequency-based implementation of Bayes rule inMatthews and Kompas (2015).10. As noted earlier, this is mathematically close to Tur-ing’s odds-based approach to code-breaking.



clinicians, can make basic statistical errorswhen conditional probabilities are used sepa-rately from these raw data mappings. For manypeople, probabilities are not intuitive; humancognition works on a different (more rela-tional) basis better suited to natural frequen-cies. Most commonly, these errors result inoverestimating the likelihood that a positivetest result means that a disease (or other con-dition) is present. It is also evident fromexperimental work carried out by Gigerenzer(2002) that standalone probability coefficientsas currently used in clinical diagnoses andintervention risk management decision-making can result in avoidable errors (some-times with dire consequences for health andwell-being).

The major advantage of a Bayesianapproach is that it considers the broader preva-lence of a given condition and the accuracy ofthe test(s) used when calculating the likelihoodthat a particular positive test result actuallymeans that the condition may be present.

To use a clinical example, if data available todate indicate that 30 out of every 10,000 peoplehave a particular disease, then 9,970 do nothave that disease. The specific likelihood that apositive test result for one person infers that thedisease is present must weigh up both the like-lihood of a true positive test result and thelikelihood of a false positive test result. In thiscase, a positive test result only has a 4.8 per centlikelihood that the disease is actually present.

Following Gigerenzer’s recommendations,these circumstances are best expressed in a‘frequency tree’—see Figure 3. Expressingdata in this raw form communicates prevalenceand the relative scale of different possible com-binations of true positive, false positive, truenegative and false negative test results.

7. The Application of Signal ProcessingMethods to Public Policy

A natural frequency-based implementation ofBayes rule lends itself to situations in which

Figure 3 Illustration of a Frequency Tree

Source: Matthews and Kompas (2015) using data from Gigerenzer (2002).



the incidence of false positive and false nega-tive test results is important. Conceptuallytherefore, it can be useful to adopt a binaryperspective (especially for systematic reviewsof evaluation results and their implications). Abinary approach cuts through the complexityevident in many evaluation studies by askingthe key question: to what extent did (or does)the experience gained from implementationsupport the validity of the hypotheses thatdefined that intervention (or emerged as aresult of that intervention). Even if an interven-tion was not explicitly framed as set of hypoth-eses to be tested, it is often possible to derivethese hypotheses from the ‘theory of change’and other documented aspects of the interven-tion design.11 When matters are framed as crispand succinct hypothesis tests it is logical tofocus, as clinicians and ICT specialists do, onthe likelihood of a false positive and falsenegative test result. In a public policy context,this dimension introduces a useful and clearbasis for expressing whether evaluation studiescontribute to, or detract from, the ability tolearn and adapt in governance.

This binary approach can be implementedby classifying test results using the Bayesianframework developed in signal processingvia a diagnostic framework as illustrated inFigure 4 (which is sometimes referred to insignal processing and machine learning circles

as a ‘confusion matrix’ because it characteriseshow test results can be ambiguous and there-fore restrict learning and adaptation).12

Figure 5 lists some of the key diagnosticmetrics that can be derived from the confusionmatrix. It is not hard to grasp the point that thissignal processing method provides the basisfor a more robust and coherent approach toevaluation than tends to be the case at present(outside of RCT tests per se).

This approach, which has it origins in areaslike the analysis of radar signals in WW2,started to be used in clinical diagnosis in the1970s and can be invaluable in bringing rigourto diagnostic assessments.13

In a public policy context, this signal pro-cessing implementation of Bayesian inferencehas some major advantages, especially inregard to the design and use of evaluationmethods and also in risk management. In bothcases, it is important to strive (via cumulativeexperience) to minimise the incidence of falsepositives (in particular)—especially false posi-tives for tests of high positive impacts. We learnbest when things do not work as intended, so it

11. This ‘retro-fitting’ has been demonstrated in experi-mental work carried out in partnership with a governmentdepartment—for a brief overview, see Matthews andWhite (2013).

12. A useful technical discussion of how this simple ‘con-fusion matrix’ can be used to develop sophisticatedapproaches in machine learning that are pertinent to publicsector evaluation can be found in Powers (2011).13. One plausible explanation for the uptake of this signalprocessing methodology in clinical diagnosis is that thegrowing use of electronic imaging instruments (CT scansetc.), which required stated machine-specific test sensitiv-ity and specificity parameters to be considered whenresults are interpreted, encouraged the adoption of theseBayesian principles and diagnostic practices.

Figure 4 Signal Processing Matrix

Test result

Condition assessment

Yes No

Positive True positive rate (TP) False positive rate (FP)

Negative False negative rate (FN) True negative rate (TN)



is useful to use evaluation methods that maxi-mise the likelihood of obtaining true negativetest results for low impact hypotheses.

This emphasis on identifying potentialtrue positives and false positives can beapplied to the NAO assessment of evaluationadequacy discussed earlier. The NAO reportclassified individual evaluation studies by boththe apparent strength of the impact achievedand the robustness of the evaluation methodused. The robustness of the evaluation methodwas assessed using the Maryland ScientificMethods Scale, as noted earlier a usefulmeasure of the extent to which an evaluation

adjusts for different potential biases—and anapproach linked to the emphasis on RCTs asthe most robust test methodology. The NAOselected 33 evaluations for this closer scrutinyof the relationship between the robustness ofthe methodology and the evaluation findings.The results are summarised in Table 2.

It is not possible to calculate definitive like-lihoods of true and false positives using theNAO’s data without rerunning evaluations thatused a low robustness evaluation method withthe most robust evaluation method. However,it is possible to come up with some plausibleestimates of the potential for generating true

Figure 5 List of Diagnostic Metrics Associated with the Confusion Matrix

TP = True positive

FP = False positive

TN = True negative

FN = False negative

(1) Test Sensitivity = TP / (TP + FN)

This metric tells us the likelihood that a positive test result will be accurate given the balance of true positives to false negative test results to date.

(2) Test Specificity = TN / (TN + FP)

This metric tells us the likelihood that a negative test result will be accurate given the balance of true negatives to false positive test results to date.

(3) Positive Likelihood = Sensitivity / (1 – Specificity)

This metric tells us the ratio of the likelihood of a positive test result given the presence of a disease and the likelihood of a positive test result if the disease is absent.

(4) Negative Likelihood = (1 – Sensitivity)/ Specificity

This metric tells us the ratio of the likelihood of a negative test result given the presence of a disease and the likelihood of a negative test result if the disease is absent.

(5) Positive Predictive Value = TP/ (TP + FP)

The proportion of positive test results given the wider prevalence of a condition (e.g. disease).

(6) Negative Predictive Value = TN / (TN + FN)

The proportion of negative test results given the wider prevalence of a condition (e.g. disease).

(7) Accuracy = (TP + TN) / (TP + FP + FN + TN)

The overall accuracy of a diagnostic test

Source: Adapted from Powers (2011), where information on additional diagnostic metrics based on the confusion matrixcan be found.



and false positive test results in such circum-stances if we make the reasonable assumption(justified by the data) that a low robustnessscore evaluation methodology with a highimpact score is more likely to be a false posi-tive than a true positive test result. Such esti-mates are useful as a means of introducingpractitioners within government (and the con-sultants who assist them) to the potentialadvantages of adopting a simplified Bayesianarticulation of the role of evaluation in thepolicy learning cycle. More definitive assess-ments would, of course, require a better devel-oped diagnostic tool than currently exists inthe public sector.

What follows is an explanation of the keyanalytical principles involved. The concludingsection of this article sketches out how such acomparative diagnostic framework might bedeveloped in the future via the coordinatedefforts of governments.

Table 3 contains four different partitionsof the NAO evaluation quality assessmentdataset, starting with the most stringent parti-tion as regards evaluation methodology robust-ness (in which everything below ‘5’ on theMaryland scale is treated as unreliable) butwith a more lenient partition as regards theimpact dimension. Scenario B is probably themost pragmatic partition for drawing a quickgeneralised conclusion. Table 4 gives thecounts of evaluation studies that result fromeach of these partitions. For Scenario B, 16 outof 33 (i.e. just under half) of the evaluation

studies can be assumed to be at risk of gener-ating false positive test results for a highimpact finding. In the most permissive sce-nario (D) as regards evaluation robustness, 21out of 33 studies (i.e. just under 64 per cent)are classed as likely to be true positives—butin the low impact partition.

These categorisations confirm the impres-sion (noted in the NAO’s own conclusions)

Table 2 Summary of NAO Findings on the Relationship between the Robustness of the Evaluation Method andMagnitude of the Impact Achieved

Count of evaluationprojects in each class

Methodology rating

Low High

1 2 3 4 5

Impact findings 4 (High) 3 4 1 1 03 2 6 0 1 02 0 2 1 6 21 (Low) 1 1 1 0 1

N = 33

Notes: The NAO used the following impact categories. 1 (low): small or insignificant effects. 2: mixed effects, positive forsome, negative or insignificant for others. 3: positive effects, with some caveats or uncertainties noted. 4 (high): significantpositive impacts, no or only minor caveats or uncertainties noted.

Source: National Audit Office (NAO 2013).

Table 3 Range of partitions applied to the NAOevaluation quality data

Impact score

Maryland scale score

1 2 3 4 5

Scenario A4 3 4 1 1 03 2 6 0 1 02 0 2 1 6 21 1 1 1 0 1

Scenario B4 3 4 1 1 03 2 6 0 1 02 0 2 1 6 21 1 1 1 0 1

Scenario C4 3 4 1 1 03 2 6 0 1 02 0 2 1 6 21 1 1 1 0 1

Scenario D4 3 4 1 1 03 2 6 0 1 02 0 2 1 6 21 1 1 1 0 1

Source: Calculations by the author using data fromNational Audit Office (NAO 2013).



that evaluations indicating that high impactshave been achieved are unreliable due to weak-nesses in the methodologies used (i.e. classedas likely to be false positives in this binarypartition approach). Aside from the 33 selectedevaluation studies assessed more thoroughlyby the NAO (as analysed above), their moregeneral assessment was that only 14 out of 70evaluations that considered cost-effectivenesswere of sufficient quality to be judged reliable.This means that 56 out of 70 (80 per cent) werejudged not to be reliable—or at risk of beingfalse positive test results using the binary char-acterisation method used here.

The implications for policy learning are thatthe ability to learn and adapt effectively isbeing constrained by a high incidence of falsepositive results relating to high impact hypoth-eses tests—the more robust the methodology,the less likely a true positive test for highimpact.

These low test sensitivities do not generaterisks of distorting future decisions if evalua-tion is mainly treated as a compliance exercise.If, however, we seek to get better at learningand adaptation, then these low test sensitivitieswill start to matter in ways that they have not todate.

Given that there is fairly widespread scepti-cism over the reliability of many evaluationassessments in the public sector, it is plausiblethat this awareness reinforces the tendencyto avoid taking learning and adaptation seri-ously—and to focus instead on policy design

and delivery with little effort to close the learn-ing loop (recall that only 305 out of 6,000, i.e.just 5 per cent, of the reports studied by theNAO were concerned with evaluation issues).The use of contractual key performance indica-tors (KPIs) and other performance measuresfurther reinforces this tendency not to prioritiselearning and adaptation. Indeed, from the per-spective sketched out in this article, situationsin which a contractor to government, or grantrecipient, meets all their contracted KPIs butfails to actually achieve something useful viathe intervention should be classed as a falsepositive test outcome. This compounds the testsensitivity problem because neither complyingfully with contractual KPIs nor any subsequentevaluation are (from a Bayesian angle) likely tocount as accurate test results from a robustvalue-for-money perspective. Of course, if therecommended approach of defining outcomeson the basis of Bayesian tests of competinghypotheses is adopted, then the severity of suchproblems is reduced.

8. Implications for Moving Forward

The implication for evaluation methodologies,and especially systematic reviews of evaluationfindings, is that it can be useful to use binarydiagnostic analyses of this type, when com-bined with gradings of evaluation quality, togenerate an overview of the reliability of find-ings—framed in terms of the likely incidence offalse and true positives. Such overviews willfocus attention on the test accuracy challengeand help to stimulate advances in evaluationmethods that aim to reduce the incidence offalse positives. At present, it can be all too easyto get lost in project- and program-specificdetails in a way that makes it hard to step backand assess evaluation findings more compre-hensively in terms of the incidence of falsepositive test results. As stressed earlier, everyfalse positive or false negative represents amissed opportunity to learn in public policy;hence, pervasive incidences of false positivesand false negatives represent a systemicproblem in governance—not least in driving upthe cost of governing above levels that wouldotherwise be the case. The recommended

Table 4 Plausible true and false positive scenarios forthe NAO data

N = 33 ineach case

Likely tobe a truepositive

Likely tobe a falsepositive

Scenario A High impact 2 27Low impact 1 3

Scenario B High impact 2 16Low impact 9 6

Scenario C High impact 2 7Low impact 12 12

Scenario D High impact 6 3Low impact 21 3

Source: Calculations by the author using data fromNational Audit Office (NAO 2013).



approach is well suited to assessing just howpervasive the problem of false positive evalua-tion test results is.

Consequently, given the potential utility tofacilitating learning and adaptation in gover-nance, it is worth investigating whether anestablished tool in signal processing andmachine learning—the receiver operatingcharacteristic curve (known as the ROCcurve)—may provide a suitable diagnosticmethod in evidence-based policy analysis ingeneral, and evaluation studies in particular.The ROC curve plots the false positive rate (onthe X axis) against the true positive rate (on theY axis) and was originally developed to assessthe abilities of radar operators in WW2.14 Theyprovide a useful means of measuring the accu-racy of test results in a robust and coherentmanner. As such, ROC plots reflect the prin-ciples behind the use of RCTs in public

policy—but in a more generally applicableframework (indeed, ROC plots are used inmedicine to assess the adequacy of RCTresults). For a useful overview of the use ofROC plots in a range of contexts, see Swetset al. (2000).

Figure 6 contains an illustration of howROC plots can be used in a diagnostic contextrelevant to public sector evaluations. The bestpossible performing hypothesis test lies in thetop left-hand corner (a test that is 100 per centsensitive and has a zero false positive rate).15

Random test results lie on the diagonal (e.g.someone guessing the toss result of a coinwould expect to eventually end up at the 0.5,0.5 point in the middle of the diagonal). Testresults that are worse than random lie belowthat diagonal (thus providing a particularlyuseful diagnostic). In a public policy context,the potential to waste public funds increases

14. As illustrated in Figure 6, some versions of this plotadd the true negative and the false negative rates to the topand the right-hand side of the diagram in order to produceplots that align fully with the parameters of the confusionmatrix.

15. Care needs to be taken with terminology becausethere are two uses of the term ‘sensitivity’ in the Bayesianand signal processing literature (as the true positive rateand as the ratio of the true positive to the sum of the truepositive and the false negative rate).

Figure 6 Principles of the Receiver Operating Characteristic Curve

Source: Modified version of figure 8 in Matthews and Kompas (2015).



the further that capabilities lie from the idealdiagnostic point in the top left-hand corner ofthe ROC space. Shifts in capability over timecan be reflected as shifts in the locus andspread of test accuracy performance. In thispublic sector evaluation context, it is useful torefer to these plots as ‘diagnostic capabilityplots (DCPs)’.16

The suggestion is that adopting a binaryapproach to evaluation in public policy (i.e.developing categorisations using the matrix inFigure 4) would be useful because it allowsDCPs to be plotted. This binary approach (thetest for a specific hypothesis is positive ornegative) can be applied to different levels ofthresholds and sensitivity—as plotted in aDCP. Consequently, the framework can beused to generate useful practical measure-ments of capability. These DCPs would bevery useful diagnostic metrics for bothprogram and contract management andorganisational capability development. In bothcases, DCPs would provide a basis for bothmeasuring test accuracy performance over dis-crete time periods and tracking changes in thisperformance over time.

From this perspective, learning and adapta-tion in public policy would be facilitated byusing DCPs to determine the effectiveness ofcurrent evaluation capabilities and to move onto develop the strategies, tactics and guidelines/tools necessary to shift an organisation’s DCPperformance towards the target region, and inparticular away from the area of worsethan random performance below the diagonalline.17 Indeed, this use of DCPs is compatible

with the capability maturity model (CMM)originally developed to certify software devel-opers but now being used more generally toassess performance in the public sector.18

9. Conclusions

The main aim of this article has been to drawthe attention of the policy community to thepotential utility of a simplified and standard-ised Bayesian framework based on the use ofnatural frequencies and betting odds as ameans of focusing attention on the importanceof learning effectively by assessing and thenreducing the incidence of false positive andfalse negative test results in evaluations. Weakanalytical capacity, reflected in a high inci-dence of false positive test results for highimpact hypotheses, can distort decision-making and reduce the capacity to learn andadapt. In general terms, ‘evidence-basedpolicy-making’ would be more useful if (justas in clinical diagnosis) it moved forward byavoiding legalistic notions of ‘proof’ in prefer-ence for exploiting the powerful insights fromthe pioneering information theory work ofClaude Shannon and Alan Turing.

The next step in this capacity-buildingagenda would be to implement this simplifiedand standardised Bayesian approach in sys-tematic reviews (meta-studies) of how effec-tively policy learning has been taking place insimilar policy areas in different countries. Theuse of DCPs would provide a useful way ofsummarising differences in diagnostic capabil-ity between countries and policy domains, andwould provide a practical yet theoreticallyrobust means of tracking the success of future

16. There is also a related diagnostic plot in signal pro-cessing and machine learning known as the detection errortrade-off (DET) plot that focuses on the relationshipbetween false negatives and false positives in signal detec-tion. Potentially, DET plots are useful as risk managementtool (e.g. for assessing performance in detecting threatsin airline security). See Martin, Doddington, Kamm,Ordowski, and Przybrocki (2015) for a discussion of theDET methodology in speech recognition technology.17. DCPs in public sector evaluation can stray into thisworse than random domain. In the original radar signalinterpretation ROC curve context, capabilities in this‘worse than random’ domain mean that the classifier’sanswer is correlated with the actual answer but is nega-tively correlated with the true answer. More work needs to

be done on what this means in an evaluation and a publicpolicy context—but it is suggestive of a ‘looking glass’situation in which verification and falsification of hypoth-eses are reversed (with consequent cost implications).18. The CMM classifies different levels of organisationalcapability into five bands based on the likelihood ofrepeated (good) performance—hence it is linked to andtherefore compatible with ROC curve-based diagnostics.The author is not aware of any existing frameworks thatexpress CMM capability bands as regions in an ROC plot;however, this could be a useful avenue to explore in futurework in this area.



capacity-building efforts in evidence-basedevaluations.

As regards the key technical challenge ofcalculating likelihoods of test inaccuracies, itis important to recall that, in clinical diagnosis,the estimated likelihood of true and false posi-tives and true and false negatives in test resultsis a function of the overall prevalence of mea-sured test accuracies (as the context againstwhich the likely validity of a specific test resultis judged). Consequently, the collective inter-national systematic review process flaggedabove would provide a means of calculatingthe overall prevalence of test inaccuracies forparticular types of evaluation, in particularpublic sector activities. In other words (andjust like in clinical diagnosis), we should seekto develop an integrated dataset that allowspractitioners in government, and their profes-sional advisors, to calibrate specific test resultsagainst the overall prevalence of test inaccura-cies. With sufficient international cooperation,perhaps facilitated by the OECD, such a globaldataset could be developed by tracking therelationships between sequences of programsand project-specific evaluations (the best wayof measuring false positive and false negativetest results is by using successive tests as thesecan reveal shortcomings in previous tests).

August 2015.The author would like to acknowledge supportfor the research upon which this article drawsprovided by (a) the Australian Centre forBiosecurity and Environmental Economics(AC BEE), and (b) the HC Coombs PolicyForum at The Australian National University(which received Australian Governmentfunding under the Enhancing Public PolicyInitiative). Particular thanks to Geoff Whiteboth for a series of useful discussions overrecent years on how to develop ‘risk-aware’evaluation methods and for useful commentson an earlier draft of this article.

References

Althaus C, Bridgman P, Davis G (2007) TheAustralian Policy Handbook, 4th edn. Allen& Unwin, Sydney.

Dale A (2012) Book Review: The Theory ThatWould Not Die. Notices of the AmericanMathematical Society 59(5), 658–60.

Gigerenzer G (2002) Reckoning with Risk:Learning to Live with Uncertainty. Penguin,London.

Gillies D (1990) The Turing-good Weight ofEvidence Function and Popper’s Measure ofthe Severity of a Test. British Journal of thePhilosophy of Science 41, 143–6.

Good IJ (1979) A. M. Turing’s Statistical Workin World War II. Biometrika 66(2), 393–6.

HM Treasury (2011) The Green Book:Appraisal and Evaluation in Central Gov-ernment. Treasury Guidance. TSO, London.

Madelano M, Waights S (2015) Guide toScoring Methods Using the Maryland Sci-entific Methods Scale. What Works Centrefor Local Economic Growth, London.

Martin A, Doddington G, Kamm T, OrdowskiM, Przybrocki M (2015) The DET Curve inAssessment of Detection Task Performance.National Institute of Standards and Technol-ogy (NIST), Gaithersburg, MD.

Matthews M (2015) Chapter 11: How BetterMethods for Coping with Uncertainty andAmbiguity Can Strengthen Government—Civil Society Collaboration. In: Carey G,Landvogt K, Barraket J (eds) Designingand Implementing Public Policy: Cross-sectoral Debates, pp. 75–89. Routledge,London.

Matthews M (2016) TransformationalPublic Policy. Routledge, London, forth-coming.

Matthews M, Kompas T (2015) Coping withNasty Surprises: Improving Risk Manage-ment in the Public Sector Using SimplifiedBayesian Methods. Asia & the PacificPolicy Studies 2(3), 452–66.

Matthews M, White G (2013) Faster, Smarterand Cheaper: Hypothesis-Testing in Policyand Program Evaluation. Evaluation Con-nections (European Evaluation SocietyNewsletter) 13–4.

McGrane SB (2011) The Theory that WouldNot Die. Yale University Press, New Haven,CT.

National Audit Office (NAO) (2013) Evalua-tion in Government. NAO, London.



Patton MQ (2010) Developmental Evaluation:Applying Complexity Concepts to EnhanceInnovation and Use. Guilford Press, NewYork. NY.

Pierce JR (1961) Symbols, Signals and Noise:The Nature and Process of Communication.Harper & Brothers, New York.

Powers DMW (2011) Evaluation: From Preci-sion, Recall and F-measure to ROC,Informedness, Markedness & Correlation.Journal of Machine Learning Technologies2(1), 37–63.

Sabel C, Zeitlin J (2012) Experimentalist Gov-ernance. In: Levi-Faur D (ed) The OxfordHandbook of Governance, pp. 169–83.Oxford University Press, Oxford.

Shannon CE (1948) A Mathematical Theoryof Communication. Bell System TechnicalJournal XXVII, 379–423.

Swets JA, Dawes RM, Monahan J (2000) Psy-chological Science Can Improve DiagnosticDecisions. Psychological Science in thePublic Interest 1(1), 1–26.



Matthews-2015-Asia & the Pacific Policy Studies · capacity-building challenges faced in the Asia...

Documents

Transcript of Matthews-2015-Asia & the Pacific Policy Studies · capacity-building challenges faced in the Asia...