The influence of interpretable machine learning on human ...

The influence of interpretable machine learning on

human accuracy

A study on the increased accuracy of a LIME-explanator on a classification

test

Rens Sturm

Master Thesis

MSc Marketing Intelligence

June 11th, 2020

2

The influence of interpretable machine learning on

human accuracy

A study on the increased accuracy of a LIME-explanator on a classification

test

By: Rens Sturm

University of Groningen

Faculty of Economics and Business (FEB)

Department: Marketing

Master: Marketing Intelligence

June 2020

First supervisor: K. Dehmamy

Second Supervisor: J. Wierenga

Saffierstraat 22, 9743LH

Groningen 06-83773819

[email protected]

Student number: S3856593

mailto:[email protected]

3

Management Summary

The field of machine learning is growing at an unprecedented rate increasing the applications in

everyday life. A new prominent role of machine learning is the automatization or support of decision

making for businesses, courts and governments, helping them to make faster and better decisions.

The rapid expansion of machine learning decision making has caused unease with academics,

consumer groups and legal experts. While the reasons for unease vary, one of the major aspects is

the fact that while the machine learning models make or support decisions, they do not provide an

explanation nor supporting arguments. Together with the fact that machine learning models are, like

all systems, fallible, there has been an increasing demand for interpretable machine learning models.

This research tests whether an interpretation mechanism included in the machine learning model

increases the trust in the model and whether an explanatory mechanism makes the decision-maker

more accurate. To do this, the researchers trained a neural network, which is a type of machine

learning model, using a dataset containing all passengers on the Titanic which sank in 1912. After

training the machine learning model, it was made interpretable by using a Local Interpretable Model-

agnostic Explanation (LIME). The interpretable model was used by 145 participants to estimate

whether certain individuals had survived the Titanic-disaster.

It was found that using an explanation mechanism positively influences the accuracy of the

participants in estimating survival-rates in a significant way (B = 0.1037, p = 0.003). Second, having

an explanatory mechanism also increase the trust of the participant in the model (B=0.8955, p = 8.8e-

08). We found women to be slightly better at predicting survival rates than men, even though the

explanation for this might have more to do with the methodology. We found no correlation between

expertise on the Titanic and accuracy, nor between experience with machine learning models and

accuracy.

Managers can use these insights in critical areas where a boost in accuracy can increase benefits a

large amount. Companies that are using machine learning already, for recommendation or

automated decision making, can use explanatory mechanisms to increase the trust of the consumer

in the model. Further research is needed for a generalization of interpretable machine learning in

other areas than decision-making support and classification problems.

Keywords: Machine learning, interpretable machine learning, neural networks, LIME

4

Preface

The writing of a thesis is a rite of passage underwent by all students. For me the journey was

thoroughly enjoyable being able to be immersed in the field of machine learning and write about it.

Combining the new field of machine learning with old history like the Titanic was especially

satisfactory. I am happy that I was able to sneak in a little history at last. Troughout the writing I was

helped by a few people I would like to thank.

First of all my supervisor Keyvan Dehmamy. While help from your supervisor can be expected,

Keyvan went truly above and beyond my expectations, something I am very grateful for. Second I

would like to thank my fellow student Darius Lamochi with who helped me from the start. At last I

want to thank my friends and family who helped me, especially Daan Romp who was so nice to check

my thesis for spelling.

I wish that you will enjoy reading my thesis.

Rens Sturm

June, 2020, Groningen

5

Table of content

Chapter 1: Introduction ......................................................................................................................6

1.1: Relevancy of the problem .........................................................................................................6

1.2: Problem analysis ......................................................................................................................6

1.3: Academic and practical relevance.............................................................................................7

1.4: Originality of the research question ..........................................................................................7

1.5: Outline thesis ...........................................................................................................................8

Chapter 2: Theoretical framework ......................................................................................................9

2.1: Growth in appliance of machine learning .................................................................................9

2.2: Interpretable and non-interpretable models .......................................................................... 10

2.3: Techniques to make uninterpretable models interpretable .................................................... 12

2.4: Appliance of interpretable black box modelling in decision support........................................ 15

2.5: Conclusion ............................................................................................................................. 16

Chapter 3: Research design & Methodology...................................................................................... 19

3.1: Introduction to neural networks ........................................................................................... 19

3.2: Model development & dataset description ............................................................................. 19

3.3: Research method ................................................................................................................... 22

3.4: Data collection ....................................................................................................................... 23

3.5: plan of analysis ....................................................................................................................... 24

3.6: Conclusion ............................................................................................................................. 24

Chapter 4: Data analysis .................................................................................................................... 26

4.1: Sample analysis. ..................................................................................................................... 26

4.2: Reliability, validity, representativity ........................................................................................ 26

4.3: Hypotheses and statistical tests .............................................................................................. 27

4.4: Interpretation ........................................................................................................................ 30

4.5: Conclusion ............................................................................................................................. 31

Chapter 5: Discussion, limitations and recommendations ................................................................. 32

5.1: Reflective discussion on the results ........................................................................................ 32

5.2: Limitations ............................................................................................................................. 34

5.3: Academic and managerial conclusions.................................................................................... 35

5.4: Conclusion ............................................................................................................................. 35

References ........................................................................................................................................ 36

6

Chapter 1: Introduction

In this chapter the research problem is introduced and its theoretical and managerial relevancy

discussed. We will show that the research problem is urgent, important and original.

1.1: Relevancy of the problem

Machine learning, an algorithm that improves by experience (Langley & Simon, 1995), has achieved a

wide level of application. It is, amongst others, used in defense, research, production, air and traffic

control, portfolio selection and decision support (Zuo, 2019). The market for machine learning is

predicted to grow 186% annually until 2024 (Zion market research, 2017). An example that shows the

potential of machine learning decision support is a study where US police chiefs used machine

learning to predict whether agents were at a danger of overreacting at civilians. The final decision to

intervene was left up to the chief, but by using machine learning support accuracy increased by 12%

(Carton et al. 2016). A Bank of England survey showed that two-thirds of finance companies in the UK

use machine learning as decision support (Bank of England, 2019), making machine learning relevant

in the 21st century.

Increased use of machine learning has evoked feelings of unease in consumers (Cio Summits, 2019).

Reasons are that used are that machine learning models often do not give an explanation nor decide

perfectly (Guidotti et al., 2018). In response to the growing feeling of unease, the European Union

has passed legislation that requires, by law, that consumers have a “right to explanation” after

having been subjected to a decision by an automated process (Regulation, 2016). Consequently if a

human agent uses a machine learning model as input for the decision, the model must be explained.

Models of which an explanation can be given are defined as interpretable or white box. Models that

cannot be explained are defined as uninterpretable or black box. Thus due to the growing concern

and growing unease the need for interpretable machine learning models has become urgent.

1.2: Problem analysis Uninterpretable models prevent organizational learning, and carry the risk of having a systematic

error in the data used for development i.e. a training bias (Guidotti et al., 2018). It may be possible

that the model is sacrificing organizational goals for the proxy goal it has been given (Doshi-Velez,

Kim, 2017) or that the model has “cheated” during training by using the wrong features (input data

points) like background or use a racial bias which is captured in the historic data (Guidotti et al.,

2018).

7

A further complication is the absence of consensus about the definition of interpretability. For some

academics and managers mechanical knowledge of the model is sufficient. Others want to know

what input was decisive in the decision-making process (Guidotti et al., 2018). Secondly a discussion

has been conducted on how to measure interpretability (Doshi-Velez, Kim, 2017). Some researchers

have stipulated that increased human accuracy is an important metric to keep in mind (Doshi-Velez,

Kim, 2017) and in some cases even the most important one.

1.3: Academic and practical relevance The challenge of making machine learning interpretable is almost as old as the field of machine

learning itself. In 1983, the first solutions of explaining why the model predicted a certain output

were proposed, laying the foundation for later innovations (Swartout, 1983). Since then, research has

mainly focused on explaining machine learning models. Removing part of the model, changing the

input to see how the output changes or calculating certain values have been proposed in order to

make these uninterpretable black box models understandable (Guidotti et al. 2018). Absent is

research on the effect of interpretation methods on the human decision maker and his/her

accuracy. Considering the use of machine learning for this goal, researching the topic of machine

learning and human accuracy provides an original and relevant research question.

Establishing a correlation between interpretable machine learning and human accuracy would help

establish increased human accuracy as an important metric for interpretable machine learning and

help settle the discussion on how to measure interpretability.

Due to the EU legislation, companies have to use interpretable machine learning for their

procedures, making it all the more relevant. Since companies use machine learning during decision

support (Bank of England, 2019) it is relevant to know whether interpretability helps in this process.

If accuracy increases it could be used by managers to make better decisions in critical situations,

helping the organization reach its goals.

1.4: Originality of the research question In this report we will look at the research question: “What is the influence of an explanation of the

output of a model on the accuracy of the human agent?” Answering this question will add to existing

literature in the following way: much research has been done regarding interpretable machine

learning and machine learning accuracy in autonomous tasks, but little research has been done about

human accuracy in non-autonomous tasks like the early warning methods described at the beginning

of this chapter. A second contribution is that while increased accuracy is proposed as a metric to

8

judge the machine learning model on, it has (to the best of our knowledge) not yet been tested in an

experimental set-up.

1.5: Outline thesis This thesis builds on earlier research and experiments. In chapter two we will discuss the theoretical

framework of the thesis and draw a number of hypotheses from the existing literature on what

seems likely, but is not proven. In the third chapter we will explain how we have set up an

experiment to test the set hypotheses. The experiment will ensure that we have accurate, unbiased

data to work with. The fourth chapter analyses the data collected in the experiment using various

statistical methods. With these methods we can accept or reject each hypotheses. Lastly, in chapter

five, we will discuss the findings and how to move on from there. We will also discuss the limitations

of our research.

9

Chapter 2: Theoretical framework

During this research we will analyze the role of machine learning recommendations on behavioral

decision making. This research uses insights of other researchers as a foundation, framework and

stepping stone. In this chapter we will briefly describe relevant literature about machine learning,

interpretable and non-interpretable machine learning and decision making. Existing literature is used

to formulate hypotheses that will be used for the conceptual framework at the end of this chapter.

These hypotheses will be tested in the next chapter.

2.1: Growth in appliance of machine learning

To understand the relevance of interpretable machine learning we have to discuss the context of

machine learning. In this paragraph we will look at applications of machine learning methods, why it

is popular and distinct between situations where a human agent is the final decision-maker or where

the machine learning algorithm is autonomous. This distinction will demark the theoretical reach of

the research.

In the past few years, machine learning has greatly improved in performance. In 1997 IBM’s Deep

Blue defeated world champion Kasparov in chess which is a structured game with rigid rules

(Greenemeier, 2020). Only fourteen years later, IBM’s Watson defeated two champions at Jeopardy,

a far more unstructured game (Cbsnews.com, 2020). Explanations for improved performance are

that computers have become more powerful, engineers use more complex models and they have

more data to work with (Hodjat, 2015; Hao, 2019). Machine learning has increased in appliance

rapidly, varying from automation: national defense, research, production automation, air and traffic

control and support: portfolio selection and human decision making support (Zuo, 2019). In an

automated role, the machine learning model takes over the task from a human planner. In a support

role, the model supports the human, but does not make decisions for him/her. Machine learning is

also used in a wide range of decision-making support. Healthcare organizations rely on neural

networks as decision support (Shahid, Rappon & Berta, 2019) and investors use neural networks, a

popular method of machine learning, as decision support during portfolio selection and risk

management (Al-Qaheri, Hasan, 2010). As mentioned earlier, two-thirds of finance firms in the UK

use machine learning (Bank of England, 2019). The increase in application is not without reason.

Trained Neural Networks are better in predicting Research and Development (R&D) costs than

humans (Bode, 1998) and a 2019 study showed that neural networks combined with human coaches

performed much better than humans alone (Zuo, 2019), providing clear benefits for the adapters of

machine learning.

10

Proposed explanations of superior performance point at limited human cognitive capacity for

processing new information (Shahid, Rappon & Berta, 2019). Machine learning models can handle

large quantities of complicated information while humans find handling many data points difficult.

Humans often work with small samples of personal experience while machine learning models work

with larger samples (Bode, 1998). Also humans have a limited capacity to mentally calculate

interactions, while especially Neural Networks have less of a problem with this (Shahid, Rappon &

Berta, 2019). Machines can work through large numbers of scenarios, while humans have a tendency

to lock in early (Nickerson, 1998). The basic explanation seems to be that the computational power

of computers is greater than human processing power.

Daniel Kahneman, Nobel laureate and professor of human decision making has famously stated that

he thinks simple algorithms consistently outperform human decision making due to a lack of

sensitivity to noise (Kahneman, Rosenfield, Ghandi & Blaser, 2016). A literature overview done in

1979 showed that there is a consistent superiority in algorithmic decision making over human

decision making (Dawes, 1979). This confirms early stated theories about machine learning

superiority. However, critics point at the fact that historical data may not be representative. Before

the 2008 crisis, house-prices steadily went up causing algorithms to assume that this would forever

happen. However the steady climb caused a financial crash, and house prices decreased sharply at an

unprecedented rate (Trejos, 2007). Other critics point out that humans are better than machine

learning models in some tasks (Thiel & Masters, 2015). Young children have no difficulty in making a

distinction between dogs and cats, while machines do.

Thus, machine learning is getting better at many tasks humans have previously performed.

Automation and decision support are two ways machine learning can add value. Models are better at

handling vast quantities of information and thinking through them, whilst humans outperform

computers in other areas. With these benefits we predict that humans will be more correct and

precise in their predictions (human accuracy). Therefore, we conclude that when machine learning is

used, and a human agent can understand it’s output, human accuracy increases.

2.2: Interpretable and non-interpretable models

While designing a machine learning model, an important decision is whether to make the model

interpretable or not. The decision influences interpretability and trust and we are interested whether

it influences human accuracy.

11

In the introduction we stated that the definition of interpretability is open for discussion (Miller,

2019). In this research we will define the interpretability of the model as the degree to which a

human can understand the cause of a decision (Biran & Cotton, 2017; Miller, 2019). Guidotti et al.

(2018) defined two components of uninterpretable models: 1) Opaqueness and 2) number of parts.

A model is opaque when internal workings cannot be observed. Neural networks are mostly

uninterpretable because internal workings are unobservable, and because they contain many parts.

Models that provide no explanation (also called black boxes) lead users to see the model as less

trustworthy (Ribeiro et al, 2016). Providing an explanation increases the acceptance of movie

recommendations (Herlocker, Konstan, Riedl, 2000). In a 2003 study, participants rated the

recommendation model as much less trustworthy if it gave no recommendation and used it

significantly less if no change was made (Dzindolet et al, 2003). Users report feelings of violated

privacy if black box models make recommendations (Van Doorn & Hoekstra, 2013), and a feeling of

unfairness (Zhang & Bareinboim, 2018).

Second, a black box model might not work as good as one thinks. It happens that a model ‘cheats’ by

focusing on an artificial feature. For example a complicated neural network could accurately predict

whether a tank was of the United States Army or not by recognizing if the photo had clouds or not

(Guidotti et al, 2018). Owned tanks were photographed with good weather, enemy tanks with bad

weather. Another model could recognize whether the animal was a wolf or a husky by spotting snow

in the picture (Guidotti et al, 2018). The reason for this ‘cheating’ is simple, the model does not know

these features are ‘off limits’. If the model could give an explanation, these kind of mistakes could be

spotted and fixed before they cause damage. Statistics on these kind of errors are missing, but the

risk is there.

Lastly, black box models that are trained on historic data can take over undesirable patterns. A

trained model in 2016 predicting the risk of crime recidivism showed a large racial bias against

people of color (Guidotti et al, 2018). Also, in 2016 Amazon’s decision to not offer same-day delivery

in minority neighborhoods was largely influenced by a predictive machine learning model (Letzter,

2020). These are all things that companies want to avoid of course, since the human rights

declaration specifies that treatment should not depend on race ("Universal Declaration of Human

Rights", 2020).

While uninterpretable models carry disadvantages, so do interpretable models. A major

disadvantage of interpretable machine learning models is that they are often simpler, and therefore

less accurate (Fong and Vedaldi, 2017). Second, interpretable models are easy to be replicated,

making them unable to be a sustainable competitive advantage (Guidotti et al, 2018).

12

While there have been few opponents against explainable models, they have been used little (Ribeiro

et al, 2016). Many agree that explainability is good to have but few are willing to offer up accuracy

and a competitive advantage in order to obtain it. The 2016 EU legislation has made it mandatory to

provide an explanation when one is given. What constitutes as an explanation however, is open for

interpretation. This ruling made the need for accurate explainable models, and a consensus on what

an explainable is and how it can be measured is all the more urgent.

In conclusion, uninterpretable models carry some major disadvantages but are still preferred in

practice due to their increased accuracy. The ruling of the EU has made it, at least in the European

Union and its member states, necessary by law to use explainable models in scenarios of decision

making. Due to the opaqueness and multitude of parts we conclude that the more complex a

machine learning model is, the less interpretable a human agent will find the model.

2.3: Techniques to make uninterpretable models interpretable

Previously, the need for accurate interpretable models was specified. Popular solutions try to explain

the model so a human can understand instead of simplifying it, so accuracy and interpretability are

both high. We will look at popular methods, and explore the influence of explanation mechanisms on

trust and understanding. Later, this relation will be used in the conceptual framework.

In the previous paragraphs the distinction between black box and white box models was made. There

is a third option: explainable black box models. These models are opaque and contain many parts but

give as a third step an explanation of what they did to enhance interpretability (Guidotti et al., 2016).

These kind of explanations are not at the expense of accuracy.

Black box models can be explained in many ways. We will look at three popular methods: ablation,

simplification and saliency.

A basic method of understanding black box models is ablation, where parts of the model are

removed to see how the accuracy changes (Xu, Crammer, Schuurmans, 2016). Important parts of the

machine learning model are thus observed. A major downside of ablation is that it is computationally

expensive and does not respond well to scaling (Sundarajan, 2017).

Simplification models try to recreate a simpler version of the original model (Yang et al, 2018). A well-

known method is called pruning, where ineffective parts of the model are cut (Guidotti et al., 2016).

By decreasing the amount of parts in a model, the interpretation goes up, since adding more parts

decreases interpretation (see previous paragraph).

13

Lastly, saliency designs increase and decrease part of the input to see when the model reaches

certain tipping points (Sundarajan, 2017; Guidotti et al., 2016). A popular method is the Integrated

Gradients method, which scales a pixel from 0% strength to 100% to determine the slope and the

optimal point of the slope. By only turning the pixels on which are necessary for the model to

understand the picture, it becomes more clear for a human why the model recognizes the picture as

such. An example is shown in figure 1 and 2 below.

Figure 1 and 2: a picture of a turtle analyzed with Integrated Gradients (Sundarajan, 2017).

A second popular saliency method, which we will use in our research, is the Local Interpretable

Model-agnostic Explanations (LIME) approach (Guidotti et al., 2016). The LIME-technique changes

the input of the model marginally and sees how the output of the model differs, thus deducing the

influence of the input-variables. An advantage of the LIME-technique is that it can be used for

multiple machine learning methods, the method is so-called model agnostic, and is therefore flexible

in use.

Saliency methods do not explain the inner workings of the machine learning model (Guidotti et al.,

2016), they merely explain the relationship between the input of the model and the output (Ribeiro,

Singh, Geustrin, 2016). This means that some aspects remain unobservable, and thus their problems

will remain.

As described earlier, withholding an explanation when one is expected, invokes feelings of mistrust,

violated privacy and unfairness (Van Doorn & Hoekstra, 2013; Zhang & Bareinboim, 2018). Second, it

is observed in the social sciences that giving an explanation increases compliance and trust even if

the explanation is not logically sound (Cialdini, 2014). A multitude of studies find that giving an

explanation increases trust and understanding (Kim et al, 2016; Gkatzia et al, 2016; Biran and

McKeown, 2017). Showing which features were most important in making the prediction can aid a

14

human to understand why the decision was made thus increase interpretability. However, It should

be noted that a 2019 study found that giving a good explanation does indeed increase trust and

understanding, but giving a bad explanation decreases trust (Papenmeier, Englebienne, Seifert,

2019). The sample size (N=327) of this study was large enough to be reliable. If the sample size would

have been small, the contradictory finding could have been a false negative. It should be noted that

this finding has, to the best of our knowledge, not yet replicated by the authors themselves or third

parties. It is still possible that the study is a false negative. If the study is replicable then a difference

in study setup or an underlying nuance is likely.

Why is an interpretable model more trustworthy than an unpredictable one? A very basic

explanation can be found in evolutionary psychology: the unknown carries risk, and we are risk-

averse (Zhang, Brennan, Lo, 2014). If we delve deeper, we find additional theories: providing an

explanation gives the participant the option to assess the fitness of the machine for that particular

case. In a 2016 study, researchers found that participants rate models as more trustworthy even

though they make faults, if the human has the possibility to change the model (Ribeiro, Singh,

Geustrin, 2016). However, it should be noted that the sample size for that particular study was rather

small (N=100). Other studies do find that explanations increase trustworthiness if the human has the

possibility to deviate from the final recommendation (Kim et al, 2016; Gkatzia et al, 2016; Biran and

McKeown, 2017). A second basic explanation is that an explanation provides the human actor with

the possibility to examine internal logic. It is well established that computer programs do not follow

‘common sense’ and make ‘silly mistakes’ (Ribeiro, Singh, Geustrin, 2016).

Providing an explanation thus seems to increase interpretability and trustworthiness, even though

there are some contrary findings. When an explanation is given, it becomes clearer why the model

gave a certain prediction. Even though the internal workings of the machine, nor the interaction

between parts however are not explained. Hence we draw the following hypotheses:

H1: When a machine learning model contains an explanatory mechanism, the human agent increases

in accuracy.

H2: When a machine learning model contains an explanatory mechanism, the human agent sees the

machine learning model as more trustworthy


machine learning model as more interpretable.

15

2.4: Appliance of interpretable black box modelling in decision support

Next we will discuss the potential of interpretable black box modelling. We will describe the

expected relation between interpretable models and human accuracy.

As we have seen before, machine learning can often predict better than humans. Judges often have

to decide whether a suspected criminal can wait out his trial at home (often by posing a bail) or in jail

if he/she is a flight risk or expected to commit crimes while at home. A 2017 study developed an

algorithm that could decide better, and thus reduce jail rates by 42% without increasing crime.

Human judges were too overwhelmed by ‘noise’, unrelated information (Kleinberg et al., 2017). The

difference opened up the question: If a machine learning model judges criminals better than human

judges, shouldn’t we leave the decision up to the bots? Opponents stipulate that this would be

unwise and even unethical (Bostrom, Eliezer, 2011). Algorithms do not understand values like

fairness and liberty, nor should we subject ourselves to computers we don’t understand. Even the

researchers themselves were hesitant to replace all human judges by computers since their study

wasn’t replicated and errors could be there (Kleinberg et al. 2017). However, they were optimistic.

Kleinberg et al. however do not specify what their recommendation is for scenarios where a model A.

does not have (sufficient) data to work with, B. doesn’t understand import variables (like certain

values) or C. where humans are better in part of the job. Hence there are limitations to when

machine learning is better than a human. In situations where the human nor the machine learning

model is superior in all the features used for prediction, there is an opportunity for synergy.

A third strand of the argument proposes a middle way. Do not replace humans with algorithms but

augment them (Rubin, 2019). This way the human and machine both work at what they are best at.

In 2017 Doshi-Velez and Kim proposed that we should judge decision-aiding models on whether they

help the decision-maker or not since this is close to the goal of the model. Using this middle way

would allow the human agent to control for variables the model does not understand, or does not

work well with. It would also simplify who is responsible if the decision does not pan out. Out of

these benefits we confirm our first hypothesis that interpretable machine learning increases accuracy

in the human agent.

The characteristics of the human agent are of influence which of these three explanations, if any, will

be found to be true. Certain biological aspects like age and gender are of importance during the

decision making process. Secondly, previous experiences or early exposure to certain variable can

influence the decision-making process. We will look at these two variables in greater detail.

Age has been found to be relevant to decision making in multiple studies. A 1975 study found that

the age of a manager influences the accuracy of decision making process significantly in a positive

16

way (Taylor, 1975). While the claim has been made that age has a negative effect on the ability to

deal with new technologies, like machine learning, the study found no empirical evidence to back this

claim up (Taylor, 1975). Other studies found elderly participants to be more risk-averse than younger

participants (Cauffman et al., 2010). The researchers found this effect to be consistent in cases where

it was advisable to be risk-averse as when it was not in their interest (Cauffman et al., 2010). Elder

people however have found to be more rigid in their thinking due to a decline in fluid cognitive

ability, i.e. the ability to reason and problem solve without relying on previous experiences (de Bruin

et al., 2010). Based on these findings we can conclude that if one classifies machine learning as a new

technology, it is likely that elder participants find it harder to work with it. This conclusion needs yet

to be tested.

Hence we draw the following hypothesis: H4: the age of the human agent influences the causal link

between machine learning support and human agent accuracy negatively.

The influence of gender on decision-making, especially in managerial positions, has been a

controversial topic for centuries. While women are, on average, underrepresented in politics

("Women in Parliaments", 2020) and in business executive positions ("Female Business Leaders",

2020) there does not seem to be a convincing biological factor affecting decision making. Women do

seem to be more risk-averse than men, regardless of the level of ambiguity (Powell & Ansic, 1997).

Secondly women do score higher on the personality ranking of agreeableness (Chapman et al., 2007).

This may mean that women are more likely to follow the advice of the machine learning model.

Hence we draw the following hypothesis: H5: the gender of the human agent influences the causal

link between machine learning support and human agent accuracy.

Age and experience are marginally correlated, but not per se the same thing. A manager of sixty

years old using a machine learning model for the first time is aged, but not experienced (in machine

learning models). While it seems common knowledge that extended experience causes superior

performance, it may not be the case. In a study under physicians, the researchers found no

correlation between age and performance (Ericsson, 2006). The same researchers found cases where

experience leads to worse performance due to overconfidence (Ericsson, 2004). Other studies have

indeed found no link between experience and performance (Ericsson & Lehmann, 1996). Thus we

expect to find no link between experience and decision accuracy.

2.5: Conclusion

From these hypotheses we draw the following conceptual model: complexity (caused by opaqueness

and the number of parts) cause the model to be less interpretable. Interpretability positively

17

influences accuracy. For the human agent to work well with the machine learning model, he or she

must trust the model to be accurate and capture the whole question. The level of trust positively

influences the accuracy of the human agent. Adding an explanatory mechanism influences two

relationships in the conceptual framework. First of all it mellows the impact of the complexity of the

model on the interpretability of the model. It makes the relationship less negative. An interpretable

model causes the human agent to be more accurate because he/she can fill in the gaps the algorithm

does not understand. Adding an explanatory mechanism moderates the effect. Second the

explanatory mechanism increases trust. By understanding the model better the human agent can

judge the model more accurate which increases trust. For an overview, see figure 3.

Figure 3: conceptual framework

The conceptual framework crystalizes five hypotheses:

H1: When a machine learning model contains an explanatory mechanism, the human agent increases

in accuracy.


machine learning model as more trustworthy


machine learning model as more interpretable.

18

H4: the age of the human agent influences the causal link between machine learning support and

human agent accuracy negatively.

H5: the gender of the human agent influences the causal link between machine learning support and

human agent accuracy.

In the next chapter called “Research design and Methodology” an experiment will be set up to test

each of these hypotheses. Chapter four will use this setup to analyze the collected data and conclude

whether these hypotheses are true or not.

19

Chapter 3: Research design & Methodology

To research whether interpretability influences accuracy we have set up an experiment since it has

(to the best of our knowledge) not been researched before. The experiment will take place in 3 steps:

1. The development of a neural network that can predict the survival of the passengers

2. Taking of a survey where the availability of an explanatory mechanism is manipulated

3. Analysis of the data

Before elaborating on the design of the experiment we would like to give a brief summary about the

nature of a neural network, the chosen type of machine learning algorithm used for the experiment.

This is not explained in the literature overview since it does not take part in the conceptual

framework, though a basic understanding can help in the understanding of the research design. If

one is already familiar with neural networks, this part can be skipped.

3.1: Introduction to neural networks

A neural network is a type of machine learning model. The task of a

neural network is to translate the input/independent variable (IV’s)

into an output/dependent variable (DV) (Goodfellow et al., 2017).

The neural network does this by sending the input to the hidden

layer (see figure 4) that increases or decreases the input by a

certain amount. The new value gets then forwarded to another

hidden layer, or an output layer if the model performs satisfactory.

In the beginning the model will not know exactly by how much to

increase of decrease the values. This is done during the training-

stage of the model. The designers of the model give the neural network a long list of IV’s together

with the DV”s. The model adjusts a little for each time it gets the question wrong, which is called

back-propagation (Goodfellow et al., 2017). This is repeated until the model cannot improve any

further. When the model is trained adequately the model is tested using a new list of questions

where the answers are not provided.

3.2: Model development & dataset description

First of all we developed a neural network (NN) that predicts whether a certain passenger of the

Titanic has survived or not by analyzing their cabin class, gender and the presence of family. This NN

20

is trained on a Kaggle-dataset. After development we made the model interpretable by adding a

Lime-explanator.

In order to develop a good model we need good data. The dataset used is provided by Kaggle, an

online learning community for data science and machine learning (TechCrunch, 2020). The Titanic

dataset used comes from historical data. The dataset is used by scholars, for example in this paper

(Chatterjee, 2018) and is verified by a Titanic remembrance group (non-profit) (Titanic Survivors,

2020). The training dataset contains 891 observations of 12 variables.

Due to the completeness of the dataset no observations were removed. In order to predict survival a

few variables were removed. The ticket-code, name, cabin name, age, fare and embarking location

showed little correspondence with the survival rate or showed incompleteness. Only the

independent variables class, Sex (gender), the presence of siblings or a spouse and the presence of a

partner or child was maintained. Also the dependent variable, whether the individual has survived,

was maintained. The presence of siblings, spouses, partners or child was transformed to a binary

variable were they were either present or not present in order to keep the model simple for the

participants as well. See table one for the data-types used after the transformation of the data.

Variable Explanation Data type

Class The class of the cabin in which the passenger

in question stayed during his/her voyage on

the Titanic.

Factor with 3 levels: First class,

Second class, Third class

Gender The gender of the passenger Factor with 2 levels: Male, Female

SibSp The presence of a sibling or spouse on the ship

on the moment of sinking

Factor with 2 levels: present or not

present

ParCh The presence of a partner or a child on the

ship

Factor with 2 levels: present or not

present

Survived Whether the passenger survived. This is the

dependent variable of the machine learning

model

Factor with 2 levels: did survive or

did not survive

Table 1: independent variables of the neural network and their data type

The neural network itself was kept simple with five layers. An input layer, three layers of 300 nodes

and the output layer had only one node (survival). The model was kept simple because it produced

adequate results already. The model was trained in 15 rounds. After each round (also called an

epoch) the model used back-propagation to adjust the model for more accuracy. After training the

21

model showed an in-sample training accuracy of 82.24 percent, and an out-sample accuracy 81.25%.

This was rounded down to 80% accuracy when the model was presented in a question in the survey.

The input layer contained four nodes for each of the input variables. These values were forwarded to

the three hidden layers.

The hidden layers used 300 nodes, together with an ‘uniform’ weight initialization and an Adam-

optimizer. The weight initialization causes the model to converge faster and therefore reach faster

optimal accuracy. The ‘uniform’ weight initialization assumes that all weight in the dataset follow the

same (uniform) distribution meaning that all weights have the same value (Thimm & Fiesler, 1997).

This was not entirely the case in our dataset. However since the uniform weight initialization is the

most neutral it is hard to get a wrong model using that specific weight initialization. Seeing that the

accuracy of the neural network was fairly high, even though the initializer was suboptimal, it was

decided that the initializer worked satisfactory. As an optimizer for the neural network the “relu”

function was chosen. The “relu”-optimizer is known to be simple and not too taxing for the computer

system running it (Alrefaei & Andradóttir, 2005). It is widely used in other scientific papers (Jiang et

al., 2018).

Lastly the output layer transformed the values of the hidden layers into a single value, the probability

of that passenger surviving. The output layer again used a “uniform” initialization and used an

“sigmoid” optimizer which translated the value into a probability which can be used by the human

agent. The model was compiled using an “Adam” optimizer. The “Adam” function works very simple.

If a node receives a value of less than zero, it pays forward a value of zero (Kingma & Ba, 2014). The

loss-function was binary cross-entropy.

At last the LIME-explanatory mechanism was added, using the variables class, gender, the presence

of siblings or a spouse and the presence of a partner or child as an explanation of the neural network

for making a certain prediction. To make the labels more interpretable they have later been manually

changed to a more interpretable version. The whole process was executed in the digital environment

of Google Collaborate. To code the LIME-explanatory mechanism we have looked at a guide on

Medium (Dataman, 2020). To code the neural network we have drawn heavily from the lessons

presented in the Datacamp-course “introduction to deep learning with Keras” and “advanced deep

learning with Keras”. Datacamp is a paid digital learning environment paid for by the university of

Groningen. For the full code used to program the model, please advise appendix one.

22

3.3: Research method

To test the hypothesis we conducted a survey. First of all participants received a general introduction

together with some base statistics about the survival rate on the Titanic. The general accuracy of the

machine learning model was also presented in order for participants to make a fully informed

decision.

Second the participants were randomized between two groups: the control group simply received a

machine learning recommendation and the treatment group received a machine learning

recommendation with a LIME-explanatory mechanism. The treatment group received an additional

paragraph, explaining the LIME-mechanism and how to interpret it. Each group got to see eight

historical passengers together with their features and a prediction of the neural network on whether

they had survived the Titanic disaster or not. The participants then had to decide if they thought

these passengers had survived or not. These questions could not be skipped and there was no none-

option in order to make sure we also collect data on fringe cases where there is no clear answer.

For an example of a question for the control group, see figure 5.

Figure 5: example question control group

For an example of a question of the treatment group, see figure 6.

Figure 6: example question treatment group

Both groups received the same amount of passengers with the same attributes. We made sure that

the questions were balanced and uncorrelated with each other using the guidelines set out for a

23

conjoint analysis. We balanced the attributes of the passengers like class and children onboard to

make sure no correlations could be found. The second group received with the neural network

prediction also an interpretation of the model with the LIME-explanatory mechanism. After each

prediction participants of both groups were asked to judge the interpretability of the model.

The survey closes with some general demographic information that participants were allowed to skip

if they preferred. These general questions contained an attention check which tests whether

participants are actually reading the questions or mindlessly filling in the blanks. Those failing the

attention check were excluded from the results. No reward was given (or offered) for those that

completed the survey. For the full survey see appendix three.

3.4: Data collection

During the survey information was gathered on five variables which correspond with the conceptual

model. In table two these five are described together with how they were asked and how they were

measured.

Variable Survey Question Measurement

human accuracy “Has the passenger survived the Titanic?” Percentage correct

Interpretability “Do you understand why the neural network

has given this prediction?”

Likert-scale

Trust “How much would you trust this model to

make decisions in real-life”?

Likert-scale

Explanatory

mechanism

Whether there is an explanation (1) or not

(0)

NA

Expertise “Have you read or watched any non-fiction

books/movies about the Titanic except for

the movie Titanic made in 1997, starring

Leonardo DiCaprio?”

Yes/no

Table 2: Independent variables conceptual framework and measurement method

For the results to be reliable the sample size needs to be large enough to compensate for outliers

(Fischer & Julsing, 2019). We aimed to collect 150 responses to our survey. However we only

24

managed to collect 122 responses which limited our analysis somewhat. Second the sample needs to

be representative with people coming from all layers of the population. Age, gender and education

will be recorded before in order to monitor this. An increased share of one group is expected since

randomized data shows patterns.

Participants were recruited from the personal network of the author, as well as his acquaintances. A

secondary source of participants will come from Facebook and reddit groups for survey collection.

There will be no monetary incentive for participating in the research.

3.5: Plan of analysis After executing the experiment we will analyze the data. For the first three hypotheses we compare

the test-sample to the control-sample. For this we will need to use a multiple regression. For the

influence of consumer characteristics we will use a multiple regression. In order to calculate the

accuracy of the participants we will first calculate a test-score by comparing the correct answers to

the given answers. For an overview see table three.

Hypothesis Statistical method

H1: When a machine learning model contains an explanatory

mechanism the human agent increases in accuracy.

Multiple regression


mechanism the human agent sees the machine learning model as

more trustworthy

Multiple regression


mechanism the human agent sees the machine learning model as

more interpretable.

Multiple regression

H4: the age of the human agent influences the causal link between

machine learning support and human agent accuracy negatively.

Multiple regression

H5: the gender of the human agent influences the causal link

between machine learning support and human agent accuracy.

Multiple regression

Table 3: hypotheses and statistical methods

3.6: Conclusion

In this chapter we set up an experiment to collect data on the five variables described above.

Furthermore a description of model parameters was provided, together with a description of the

25

process in which the model is developed. With the experiment we will collect data needed to test the

hypotheses described in the chapter “literature overview”. In the next chapter we will analyze the

collected data in order to conclude whether the hypotheses can be accepted or rejected.

26

Chapter 4: Data analysis

In this chapter we will analyze that data collected from the experiment described in the previous

chapter to confirm or reject the hypotheses stated before. For this we will use various statistical

methods. We will explain how the process of analysis has been executed for maximum transparency.

We will end with an interpretation of the results. For the full code used for the analysis, please advise

appendix two.

4.1: Sample analysis. On the 5th of May data collection started, it was ended on the 15th of May, collecting 145 responses.

The data was exported to a comma-separated file (CSV). After rejecting participants who have failed

the attention check (including those that quit halfway) 122 respondents remained.

To calculate accuracy each estimation was checked with the correct answer and the answer was

deemed TRUE (correct) or FALSE (incorrect). The mean of these answers, where TRUE = 1 and FALSE

= 0 gave us a score. The highest score was 87.5%, the lowest 12.5%. Of the participants 55 (45%) was

female and 67 (55%) was male. The average age was 29 years. Most (107 participants) are Dutch,

with 7 Germans, 3 Belgian and 5 participants of other nationalities. 59% (73 participants) stated to

have never read or watched anything about the Titanic, excluding the popular 1997 movie. 41% (49

participants) did.

There were some technical problems. Due to an editing error a question to the control-group (which

did not receive a LIME-explanator) referred to the LIME model which was not there. One participants

complained about this. A second participant found the question to be vague. She asks: “Do you mean

whether I understand the model or whether I agree with it?” No other participants complained about

this, so it is possible that it was a limited confusion. Lastly we collected less responses than expected.

4.2: Reliability, validity, representativity For a good analysis the sample of the population who did the experiment needs to be representative,

valid and reliable (Fischer & Julsing, 2019). As we saw before the gender division in the sample is

roughly equal. There are a few more women than men, but not enough to significantly skew the

population.

As is shown in figure seven, the sample does have a significant distortion in favor of the younger

participants. The bulk of the participants is between twenty and forty years old, with only a few older

than forty years old. There are also few participants younger than fifteen years old, but since it is

27

unexpected that they will use machine learning models for decision support, it is not distorting the

sample. This previously mentioned distortion is not good. When we look at ethnicity we see another

distortion. The vast majority of participants is Dutch, which means not all ethnicities are well-

represented. This is a limitation of the study.

Figure 7: age distribution of the sample

Due to the fact that the real answers were not shown to the participant no maturation or learning

effect could take place ensuring internal validity (Nair, 2009). The chance that the findings are an

outlier, and more testing would lead to less extreme scores (also called regression to the mean) is a

possibility though unlikely seen the high significance as described in the next paragraph. Participants

were divided at random, ensuring no systematic group bias occurred. Lastly the drop-out rate of the

participants is worrying. It can not be established whether the twenty-three participants who

dropped out were the result of a systemic flaw or randomly dropped out. This may prove to be a

threat to the validity of the research and needs to be studied in any replication study.

4.3: Hypotheses and statistical tests

The first hypothesis states that participants with a LIME-explanator are more accurate (when a

machine learning model contains an explanatory mechanism the human agent increases in accuracy).

The average accuracy for the treatment group with LIME is higher (M = 0.64, SD = 0.115) than for the

control group (M = 0.566, SD = 0.172). To control for the effect of participant characteristics, see

hypothesis four and five, we performed a multiple regression. We found that the LIME-explanatory

mechanism had a significant, positive effect (B = 0.1037, p = 0.003). We can therefore conclude that

having an explanatory mechanism increases accuracy significantly. For the full results, please consult

table 4.

28

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.215e-01 4.109e-02 15.125 < 2e-16 ***

LIME 1.037e-01 2.793 e-02 3.712 0.0003 ***

Age -7.766e-05 1.104 e-03 -0.070 0.9440

Gender -6.903-02 2.832 e-02 -2.438 0.0167 *

Knowledge on

the Titanic -7.027e-02 3.505 e-02 -2.005 0.0473 *

Experience with

machine learning -2,572e-03 3.940 e-02 -0.065 0.9481

Table 4: the influence of LIME-explanatory mechanism on accuracy when controlling for consumer

characteristics

The second hypothesis states that participants with a LIME-explanator trust the model more (when a

machine learning model contains an explanatory mechanism the human agent sees the machine

learning model as more trustworthy). Again we performed a multiple regression to control for

consumer characteristics. We found that the treatment group (M=3.81, SD=0.661) sees the machine

learning model as significantly more trustworthy (B=0.8955, p = 8.8e-08) than the control group

(M=2.78, SD=0.917). Hence we can conclude that having a LIME-explanatory mechanism increases

trust significantly. There is no relation between the age of the participant and the trust they place in

the machine learning model. We concluded so after having done a Pearson-correlation (r(120) = -

0.93, p = 0.352).

Variable Estimate Standard Error T-value P-value

Intercept 2.5885 0.2307 11.222 < 2e-16

LIME 0.8955 0.1568 5.711 8.8e-08

Age 0.0021 0.0062 0.344 0.7317

Gender 0.2834 0.1589 1.783 0.0772

Knowledge on the

Titanic -0.2883 0.1967 -1.466 0.1455

Experience with

machine learning 0.4098 0.2212 1.853 0.0664

Table 5: the influence of LIME-explanatory mechanism on trust when controlling for consumer

characteristics

29

The third hypothesis states that participants with a LIME-explanator see the model as more

interpretable (when a machine learning model contains an explanatory mechanism the human agent

sees the machine learning model as more interpretable). After performing a multiple regression we

find that the treatment group (M=3.63 , SD=0.484) sees the machine learning model as significantly

more interpretable (B=0.3966, p = 0.0003) than the control group (M = 3.25, SD=0.568). Thus we can

conclude that having a LIME-explanatory mechanism increases interpretability of the machine

learning model significantly.


Intercept 3.1764 0.1563 20.305 2e-16

LIME 0.3966 0.1062 3.733 0.0003

Age 0.0033 0.0042 0.785 0.4343

Gender -0.0014 0.1077 -0.013 0.9895

Knowledge on the

Titanic -0.1866 0.1333 -1.400 0.1642

Experience with


Table 6: the influence of LIME-explanatory mechanism on interpretability when controlling for

consumer characteristics

Lastly we looked at the influence of the participants characteristics (the age of the human agent

influences the causal link between machine learning support and human agent accuracy negatively)

and (the gender of the human agent influences the causal link between machine learning support and

human agent accuracy). We used a multiple linear regression to predict the accuracy of the

participant by their age, gender, whether they had any expertise on the Titanic and whether they had

ever worked with machine learning methods before. A significant link between these variables was

found F(4, 117) = 2.882, p = 0.028, with an R2 of 0.09. After controlling for the number of variables in

the model we found an adjusted R2 of 0.0568. We find that men are slightly worse at predicting (β = -

0.06, p = 0.033) than women. Having read or watched non-fiction about the Titanic decreased

accuracy (β = -0.08, p = 0.042). Age (β = -0.68, p = 0.475) nor experience with machine learning (β = -

0.04, p = 0.345) influenced accuracy significantly (see table 7).

30


Intercept 0.6768 0.0403 16.781 <2e-16

Age -0.0008 0.0011 -0.717 0.4748

Gender -0.0643 0.0298 -2.159 0.0329

Knowledge on the

Titanic -0.0759 0.0369 -2.059 0.0417

Experience with


Table 7: Estimation of participant characteristics on the level of accuracy

With an ANOVA-test we tested whether participants rate their own gender as more likely to survive

than passengers of the other gender. We found that while female participants rate it more likely (F(1,

120) = 5.882, p = 0.017) that female passengers survived, male participants did not rate it more likely

(F(1, 120) = 1.019, p = 0.315 that male passenger survived. Reasons for this will be explained in the

discussion.

4.4: Interpretation

After having done the statistical tests, and having rejected or accepted all of the hypotheses

formulated earlier, we can draw some conclusions of the influence of the LIME-explanatory

mechanism on the decision-making-process.

First of all we can conclude that having a LIME-explanator has a significant positive effect on the

interpretability of the neural network. Participants who got a LIME-explanation rated the machine

learning model as significantly more interpretable than participants who did not.

Second with all other variables being held we can also conclude that a LIME-explanator has a positive

effect on the trust in the model. In the literature overview (chapter two, paragraph three) we already

gathered some explanations on why this would happen and after collecting the data and using

statistical methods we can conclude that providing a LIME-explanation mechanism have a significant

and positive effect on the trust one places in the machine learning model.

Thirdly a central question of the research was whether providing a LIME-explanation increases the

accuracy of the Human Agent in making decisions and predictions. After analysis we can conclude

that it does. We do not however know for certain why there is a positive effect, but we can conclude

31

with a fairly high degree of confidence that having a LIME-explanator does make participants more

accurate than participants without a LIME-explanator. It should be noted that the second group,

even though they didn’t receive a LIME-explanator, did receive the same machine learning model

with the same level of accuracy.

Lastly we concluded that women are slightly more accurate than men. We also found that women

are more optimistically than men in predicting their own gender. In the next chapter we will describe

why we do not think that this is a conclusion we can generalize to other cases. We will therefore

refrain from interpreting these results.

4.5: Conclusion

After having gathered and analyzed the data we have concluded that providing a LIME-explanator

positively influences the trust a participant places in the machine learning model. We also concluded

that the LIME-explanator increases interpretability significantly, and it makes the participants

significantly more accurate. In the next chapter we will look at the extent to which we can generalize

these findings.

32

Chapter 5: Discussion, limitations and recommendations

In the previous chapters hypotheses were formulated and tested. We have accepted some of the

hypotheses and rejected others. In this chapter we will discuss the findings, the limitations of the

research performed and conclusions that we can draw from this research, for managers as well as

academics.

5.1: Reflective discussion on the results

During this research we have investigated the influence of an explanation on trust as well as

accuracy. As we saw in the previous chapter we can conclude that the amount of trust an average

participant has in the machine learning model to predict accurately increases significantly if an

explanation is provided. Participants also become more accurate in predicting the survival rates of

passengers when they receive an explanation. The exact cause of the increase in accuracy remains

uncertain. The increase in accuracy can be explained in two different ways, which both circle back to

the literature overview of chapter two.

1) The Human Replacement-explanation: a possible explanation is that an increase in trust causes

participants to rely more on the machine learning model than on their own intuition. Since

algorithms are in general more accurate than human judgement (Dawes, 1979) this could cause the

increase in accuracy.

2) The different-viewpoint-explanation: participants that received the LIME-explanatory mechanism

viewed their model as significantly more interpretable. They understood the reason why the model

predicted a certain outcome and due to this understanding had an opportunity to overrule the

model. If the neural network predicted that Paul Chevre would survive because he traveled first-

class, the participant may have an extra piece of information. Perhaps he or she has visited the grave

of Paul Chevre, or has seen a documentary where he offered up his seat to a child. Without the LIME-

explanation it would be unclear what information has been taken into account whilst making the

prediction.

In the literature overview we have condensed a few reasons on why it might be preferable to have a

human make the end-decision. He can be held accountable, can’t ‘cheat’, and understands the

difference between a proxy-goal and the end-goal (Doshi-Velez, Kim, 2017). The Human

Replacement-explanation takes a dim view on this perspective. If a human becomes more accurate

when he relies more on the model, can he become more accurate by relying totally on the model?

And if he does that, what is the use of the human decisionmaker then? If the different-viewpoint

33

explanation is true companies should do the opposite, and let their most experienced employees

work with the machine learning models since they are the reason for the superior performance.

The clash between these two ideas show the importance of further research in the field of

interpretable machine learning. From the gathered data we might get the impression that knowledge

about the Titanic has a negative influence on accuracy since the coefficient was negative (-0.076) and

significant. However it should be kept in mind that having read a book or seen a documentary about

the Titanic hardly provides the detailed knowledge needed for this survey. On the other hand, the

research showed that human agents do become more accurate when using interpretable machine

learning model. This is confirmatory evidence for the previously mentioned strand of literature that

states that humans can benefit from working together with machine learning models (Zuo, 2019;

Doshi-Velez, Kim, 2017). More research is needed to provide a definitive answer on where this

increased accuracy comes from.

A second finding to discuss is the fact that female participants rate female passengers on the Titanic

as more probable to survive than male participants rate female passengers to survive. On the other

hand, but male participants do not expect male passengers as more probable to survive than female

participants rate male passangers. There are two psychological causes that can explain this

phenomena: the availability bias and anchoring.

The availability bias is the mental tendency to correlate the availability of a memory with the

probability of it happening (Kidd et al., 1983). A popular example is that we estimate the chance of

death by a terrorism attack to be more likely than death by a car-accident even though the latter is a

thousand times more likely ("What do Americans fear?", 2020). A terrorism event is very memorable

and will therefore be very available. The availability bias can explain why women see other women as

more likely to survive. In the popular 1997 movie about the Titanic by James Cameron the male

protagonist dies and the female counterpart survives. Since the movie is more popular under women

it is likely that they are more affected by the availability bias than men (Todd, 2014).

A second explanation is the anchoring bias. When placed in a new situation the first data-points and

decisions guide us trough the rest of the process (Aronson et al., 2017). In both versions of the survey

a female passenger is introduced first. In the non-LIME variant it is Nelle Stevenson with a predicted

survival rate of 98%, in the LIME variant it is Carrie Toogood with a survival rate of 97%. These two

observations are both female with a very high chance of survival. It is possible that participants have

used these observations as an anchor that women are likely to survive, and that this effect has spilled

34

over in the decision process. If so, the effect will disappear if a replication study uses a more neutral

first question.

5.2: Limitations

During the research process certain decisions were made which limit the findings and the

generalizability of the conclusions made in the previous chapter. For clarity we have grouped these

limitations in two categories: limitations regarding research design and other limitations.

First of all participants were asked a closed question with only two options: yes or no. These type of

decisions happen in real life regularly (Fischhoff, 1996; Nutt, 1993), though more open decisions also

occur and are not represented in the survey. The generalization of the influence of explanatory

mechanism on open-ended decisions is uncertain at best.

Second the research had a high drop-out rate. 23 of the 145 participants dropped out, almost 16%. It

is likely that this drop-out is random since both the control group and the treatment group are of the

same size. However we cannot know for sure and the dropout rate should be kept in mind if the

experiment is replicated. The high drop-out rate also caused a low response rate of only 122

respondents, well below the set target of a 150 respondents. While the hypotheses could still be

tested with a high level of significance a replication with a larger sample size is preferable.

While the size of the sample group is one thing, the lack of diversity is another source of concern. As

stated in chapter four the sample is mainly Dutch and young and does not represent the sample as a

whole very well. Differences in age might therefore not be adequately interpreted.

Third is the generalizability of conclusions to other types of machine learning models or explanation

mechanisms. During the research we have chosen for a neural network since it is opaque and

complicated. When a decision tree was used, which is another form of machine learning, as the

model results could have differed. The used neural network was also simple in setup, as more

complicated neural networks capture deeper patterns it remains the question whether LIME-

explanators can capture this increased complexity in its recommendation. It is hard to make a

prediction on this, hence more research is needed.

Lastly participants were asked to make predictions about a sample of Titanic passengers of which the

most relevant parameters and the distribution of the survival-rate per parameter was known. This

may not be applicable to all scenario’s. In predicting the success of a start-up not all relevant

parameters are known, nor the distribution of success per parameter. Whether an explanatory

mechanism helps in this scenario depends on the causal mechanism described earlier.

35

5.3: Academic and managerial conclusions

Interpretable machine learning offers opportunities for companies in ways we are yet to learn fully.

In general we can conclude upon this research that it is recommendable for companies which already

use machine learning to make their algorithms interpretable for their employees and customers.

Having interpretable machine learning models would mean that their employees are more able in

decision-making and prevent mistakes from happening that computers simply miss. Second, by

making their machine learning models more interpretable it creates more trust between customer

and the company than if the model was uninterpretable which can evoke negative feelings. Lastly

new EU-regulation requires companies to provide an explanation to their customers when requested

anyway. Complying with this new legislation is without doubt a high priority for most firms.

For academics more research is needed on the influence of diverse machine learning methods and

explanatory mechanisms on human accuracy. In the research executed we have used one machine

learning method (a neural network) and only one explanatory mechanism. It could very well be that

other machine learning methods in combination with other explanatory mechanisms would deviate

from the findings found here, and that therefore the conclusions drawn here cannot be generalized.

Secondly the effectiveness of interpretable machine learning model decision support is relatively

untested in more unpredictable environments where not all the facts nor all the important variables

are known. During the experiment the number of outcomes was fairly restricted (survived/did not

survive) and the relevant variables on which to make a decision were known. Many decisions made

are not as structured but do carry much importance and relevance for everyday life. While the

importance is great, little research has been done and it provides a rare opportunity.

5.4: Conclusion

After the research a fundamental question remains: do humans and algorithms perform better

together or in isolation? Even though we have not answered this question definitively we have

provided an answer on how they can work better together. Next we have shown why same-gender-

optimism may have been a design flaw of the experiment. We have limited our conclusions to fit the

experiment done, and we have concluded with some general recommendations for managers and

ideas for future research.

36

References

Al-Qaheri, H. Hasan, M. (2010). An End-User Decision Support System for Portfolio Selection: A Goal

Programming Approach with an Application to Kuwait Stock Exchange (KSE). International Journal of

Computer Information Systems and Industrial Management Applications.

Alrefaei, M., & Andradóttir, S. (2005). Discrete stochastic optimization using variants of the stochastic

ruler method. Naval Research Logistics (NRL), 52(4), 344-360. https://doi.org/10.1002/nav.20080

Aronson, E., Wilson, T., Fehr, B., & Sommers, S. (2017). Social psychology (9th ed.). Pearson.

B. Kim, J. Shah, F. Doshi-Velez (2015). Mind the gap: A generative approach to interpretable feature

selection and extraction. NIPS

Bank of England. (2019). Machine learning in UK financial services. London: Bank of England.

Retrieved from https://www.bankofengland.co.uk/-/media/boe/files/report/2019/machine-learning-

in-uk-financial-services.pdf

Biran. O, Cotton. C, (2017) Explanation and justification in machine learning: a survey. Workshop on

Explainable Artificial Intelligence (XAI), pp. 8-13.

Bode, J. (1998). Decision support with neural networks in the management of research and

development: Concepts and application to cost estimation. Information & Management, 34(1), 33-

40. doi: 10.1016/s0378-7206(98)00043-3

Bostrom, Nick; Yudkowsky, Eliezer (2011). "The Ethics of Artificial Intelligence" (PDF). Cambridge

Handbook of Artificial Intelligence. Cambridge Press.

C. Yang, A. Rangarajan and S. Ranka, Global model interpretation via recursive partitioning. In IEEE

20th International Conference on High Performance Computing and Communications; IEEE 16th

International Conference on Smart City; IEEE 4th International Conference on Data Science and

Systems (HPCC/SmartCity/DSS), pp. 1563-1570, 2018, June.

Cauffman, E., Shulman, E., Steinberg, L., Claus, E., Banich, M., Graham, S., & Woolard, J. (2010). Age

differences in affective decision making as indexed by performance on the Iowa Gambling Task.

Developmental Psychology, 46(1), 193-207. https://doi.org/10.1037/a0016128

Chapman, B., Duberstein, P., Sörensen, S., & Lyness, J. (2007). Gender differences in Five Factor

Model personality traits in an elderly cohort. Personality And Individual Differences, 43(6), 1594-

1603. https://doi.org/10.1016/j.paid.2007.04.028

37

Chatterjee, T. (2018). Prediction of Survivors in Titanic Dataset: A Comparative Study using Machine

Learning Algorithms. International Journal Of Emerging Research In Management And Technology,

6(6), 1. https://doi.org/10.23956/ijermt.v6i6.236

Cialdini, R. (2014). Influence: science and practise (6th ed.). Harlow, Essex: Pearson.

Cio Summits. (2019). What consumers really think about AI (p. 2). Cio Summits.

COMPLEXITY | meaning in the Cambridge English Dictionary. (2020). Retrieved 19 February 2020,

from https://dictionary.cambridge.org/dictionary/english/complexity

Computer Crushes the Competition on 'Jeopardy!'. (2020). Retrieved 26 February 2020, from

https://www.cbsnews.com/news/computer-crushes-the-competition-on-jeopardy/

D Gkatzia, O Lemon, and V Rieser. (2016) Natural language generation enhances human

decisionmaking with uncertain information. In ACL

DARPA Announces $2 Billion Campaign to Develop Next Wave of AI Technologies. (2020). Retrieved

20 February 2020, from https://www.darpa.mil/news-events/2018-09-07

Dataman, D. (2020). Explain Your Model with LIME. Medium. Retrieved 15 May 2020, from

https://medium.com/analytics-vidhya/explain-your-model-with-lime-5a1a5867b423.

Dawes, R. (1979). The robust beauty of improper linear models in decision making. American

Psychologist, 34(7), 571-582. doi: 10.1037/0003-066x.34.7.571

de Bruin, W., Parker, A., & Fischhoff, B. (2010). Explaining adult age differences in decision-making

competence. Journal Of Behavioral Decision Making, 25(4), 352-360.

https://doi.org/10.1002/bdm.712

Doshi-Velez, F., Kim, B. (2017) Towards a rigorous science of interpretable machine learning.no ML:

1-13

Encyclopedia Titanica. 2020. Titanic Survivors. [online] Available at: <https://www.encyclopedia-

titanica.org/titanic-survivors/> [Accessed 5 May 2020].

Ericsson, A. (2006). The Cambridge handbook of expertise and expert performance (2nd ed.).

Camebridge press.

Ericsson, K. (2004). Deliberate Practice and the Acquisition and Maintenance of Expert Performance

in Medicine and Related Domains. Academic Medicine, 79(Supplement), S70-S81.

https://doi.org/10.1097/00001888-200410001-00022

38

Ericsson, K., & Lehmann, A. (1996). EXPERT AND EXCEPTIONAL PERFORMANCE: Evidence of Maximal

Adaptation to Task Constraints. Annual Review Of Psychology, 47(1), 273-305.

https://doi.org/10.1146/annurev.psych.47.1.273

Female Business Leaders: Global Statistics. Catalyst. (2020). Retrieved 4 June 2020, from

https://www.catalyst.org/research/women-in-management/.

Fischer, T., & Julsing, M. (2019). Onderzoek doen !. Groningen: Noordhoff Uitgevers.

Fischhoff, B. (1996). The Real World: What Good Is It?. Organizational Behavior And Human Decision

Processes, 65(3), 232-248. https://doi.org/10.1006/obhd.1996.0024

Fong, R. Veldadi, A. (2017) Interpretable Explanations of Black Boxes by Meaningful Perturbation.

Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)

Goodfellow, I., Bengio, Y., & Courville, A. (2017). Deep learning. The MIT Press.

Greenemeier, L. (2020). 20 Years after Deep Blue: How AI Has Advanced Since Conquering Chess.

[online] Scientific American. Available at: https://www.scientificamerican.com/article/20-years-after-

deep-blue-how-ai-has-advanced-since-conquering-chess/ [Accessed 26 Feb. 2020].

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2018). A Survey of

Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5), 1-42. doi:

10.1145/3236009

Halford, G., Baker, R., McCredden, J., & Bain, J. (2005). How Many Variables Can Humans Process?.

Psychological Science, 16(1), 70-76. doi: 10.1111/j.0956-7976.2005.00782.x

Hao, K. (2019). We analyzed 16,625 papers to figure out where AI is headed next. Retrieved 4 March

2020, from https://www.technologyreview.com/s/612768/we-analyzed-16625-papers-to-figure-out-

where-ai-is-headed-next/

Herlocker, Konstan, Riedl. (2000) Explaining collaborative filtering recommendations. Computer

supported Cooperative Work (CSCW)

Hinson, J., Jameson, T., & Whitney, P. (2003). Impulsive decision making and working memory.

Journal Of Experimental Psychology: Learning, Memory, And Cognition, 29(2), 298-306. doi:

10.1037/0278-7393.29.2.298

Hodjat, B. (2015). The AI Resurgence: Why Now?. Retrieved 4 March 2020, from

https://www.wired.com/insights/2015/03/ai-resurgence-now/

39

Hu, X., Niu, P., Wang, J., & Zhang, X. (2019). A Dynamic Rectified Linear Activation Units. IEEE Access,

7, 180409-180416. https://doi.org/10.1109/access.2019.2959036

J. L. Herlocker, J. A. Konstan, and J. Riedl. (2000) Explaining collaborative filtering recommendations.

Conference on Computer Supported Cooperative Work (CSCW).

Jiang, X., Pang, Y., Li, X., Pan, J., & Xie, Y. (2018). Deep neural networks with Elastic Rectified Linear

Units for object recognition. Neurocomputing, 275, 1132-1139.

https://doi.org/10.1016/j.neucom.2017.09.056

Kidd, J., Kahneman, D., Slovic, P., & Tversky, A. (1983). Judgement under Uncertainty: Heuristics and

Biasses. The Journal Of The Operational Research Society, 34(3), 254.

https://doi.org/10.2307/2581328

Kingma, D.P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv.

https://arxiv.org/abs/1412.6980

Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2017). Human Decisions and

Machine Predictions*. The Quarterly Journal Of Economics. doi: 10.1093/qje/qjx032

L. Xu, K. Crammer, D. Schuurmans. (2006) Robust support vector machine training via convex outlier

ablation. AAAI

Langley, P., & Simon, H. (1995). Applications of machine learning and rule induction. Communications

Of The ACM, 38(11), 54-64. doi: 10.1145/219717.219768

Letzter, R. (2020). Amazon just showed us that 'unbiased' algorithms can be inadvertently racist.

Retrieved 26 February 2020, from https://www.businessinsider.com/how-algorithms-can-be-racist-

2016-4?international=true&r=US&IR=T

M. Ancona, C. Öztireli and Gross, “Explaining Deep Neural Networks with a Polynomial Time

Algorithm for Shapley Values Approximation,” In ICML, 2019

M. Bilgic, R.J. Mooney (2005). Explaining reccomendations: satisfaction vs promotion. Workshop on

the next stage of reccomender systems research.

M.T. Ribeiro, S. Singh, C. Guestrin. (2016). “Why should I trust you?” Explaining the predictions of any

classifier. Proceedings of the ACM SIGKDD international conference on knowledge discovery and data

mining.

Mark or unmark Spam in Gmail - Computer - Gmail Help. (2020). Retrieved 14 February 2020, from

https://support.google.com/mail/answer/1366858?co=GENIE.Platform%3DDesktop&hl=en

40

Mary T. Dzindolet, Scott A. Peterson, Regina A. Pomranky, Linda G. Pierce, Hall P. Beck (2003) The

role of trust in automation reliance. International Journal of Human-Computer studies.

Matthias, A. (2004). The responsibility gap: Ascribing responsibility for the actions of learning

automata. Ethics And Information Technology, 6(3), 175-183. doi: 10.1007/s10676-004-3422-1

Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial

Intelligence, 267, 1-38. doi: 10.1016/j.artint.2018.07.007

Mueller, J., & Massaron, L. (2016). Machine Learning For Dummies. For Dummies.

Nair, S. (2009). Marketing research. Himalaya Pub. House.

Netzer, O., Lemaire, A., & Herzenstein, M. (2016). When Words Sweat: Identifying Signals for Loan

Default in the Text of Loan Applications. SSRN Electronic Journal. doi: 10.2139/ssrn.2865327

Nickerson, R. (1998). Confirmation Bias: A Ubiquitous Phenomenon in Many Guises. Review Of

General Psychology, 2(2), 175-220. doi: 10.1037/1089-2680.2.2.175

Nutt, P. (1993). The Identification of Solution Ideas During Organizational Decision Making.

Management Science, 39(9), 1071-1085. https://doi.org/10.1287/mnsc.39.9.1071

Official Journal of the European Union. Regulation (EU) 2016/679 of the European Parliament and of

the Council of 27 April 2016 on the protection of natural persons with regard to the processing of

personal data and on the free movement of such data, and repealing Directive 95/46/EC (General

Data Protection Regulation) (2016).

Or Biran and Kathleen McKeown.(2017) Human-centric justification of machine learning predictions.

In IJCAI, Melbourne, Australia.

Papenmeier, Englebienne, Seifert. (2019) How model accuracy and explanation fidelity influence user

trust in AI. arXiv(2019)

Powell, M., & Ansic, D. (1997). Gender differences in risk behaviour in financial decision-making: An

experimental analysis. Journal Of Economic Psychology, 18(6), 605-628.

https://doi.org/10.1016/s0167-4870(97)00026-3

R. Sinha, K. Swearingen (2002). The role of transparency in reccomender systems. CHI EA.

R. Zhang, T.J. Brennan, A.W. Lo (2014). The origin of risk aversion. Proceedings of the National

Academy of Sciences of the United States of America.

41

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use

interpretable models instead. Nature Machine Intelligence, 1(5), 206-215. doi: 10.1038/s42256-019-

0048-x

S. Carton, J. Helsby, K. Jospeph, A. Mahmud, Y. Park, J. Walsh, C. Cody, E. Patterson, L. Haynes, R.

Ghani. (2016) Identifying police officers at risk of adverse events. 22nd ACM SIGKDD internation

conference.

Saaty, T., & Ozdemir, M. (2003). Why the magic number seven plus or minus two. Mathematical And

Computer Modelling, 38(3-4), 233-244. doi: 10.1016/s0895-7177(03)90083-5

Shahid, N., Rappon, T., & Berta, W. (2019). Applications of artificial neural networks in health care

organizational decision-making: A scoping review. PLOS ONE, 14(2), e0212356. doi:

10.1371/journal.pone.0212356

Sundarajan, M., Taly, A., Yan, Q. (2017). Axiomatic Attribution for Deep Networks. Proceedings of the

34th International Conference on Machine Learning, volume 70, 3319-3328.

Symeonidis, Y. Manolopoulos (2012). A Generalized taxonomy of explanations styles for traditional

and social reccomender systems. Data mining Knoweledge discovery. 24(3):55-583.

Taylor, R. (1975). Age and Experience as Determinants of Managerial Information Processing and

Decision Making Performance. Academy Of Management Journal, 18(1), 74-81.

https://doi.org/10.5465/255626

Techcrunch.com. 2020. Techcrunch Is Now A Part Of Verizon Media. [online] Available at:

<https://techcrunch.com/2017/03/07/google-is-acquiring-data-science-community-

kaggle/?guccounter=1&guce_referrer=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnLw&guce_referrer_si

g=AQAAAMCM_xWxx7fIVRSuLoqU99ItVAr-3TwqbYmj4-FYg0QYau-

FLkaXnyH1PeI0i8DzyrkjTq2dX6UuRm_lCUdWmgFazfDuuTcHVzarJlbcZSjGYq9LCv7GiENddork0BfC5Ex

M0BIqSEG9WS1Ftoy5M0k2dx-t9gDD-eMx5n9VK9SH> [Accessed 5 May 2020].

Thiel, P., & Masters, B. (2015). Zero to one. [United States]: Bokish Ltd.

Thimm, G., & Fiesler, E. (1997). High-order and multilayer perceptron initialization. IEEE Transactions

On Neural Networks, 8(2), 349-359. https://doi.org/10.1109/72.557673

Todd, E. (2014). Passionate Love and Popular Cinema. Palgrave Macmillan.

Trejos, N. (2007). Existing-Home Sales Fall Steeply. Retrieved 4 March 2020, from

https://www.washingtonpost.com/wp-dyn/content/article/2007/04/24/AR2007042400627.html

42

Universal Declaration of Human Rights. (2020). Retrieved 26 February 2020, from

https://www.un.org/en/universal-declaration-human-rights/

Wernicke, S.: 2015, ‘How to use data to make a hit tv show’.

What do Americans fear?. ScienceDaily. (2020). Retrieved 28 May 2020, from

https://www.sciencedaily.com/releases/2016/10/161012160030.htm.

When Computers Decide: European Recommendations on Machine learning automated decision

making. (2020). Retrieved 14 February 2020, from

https://www.acm.org/binaries/content/assets/public-policy/ie-euacm-adm-report-2018.pdf

Women in Parliaments: World and Regional Averages. Archive.ipu.org. (2020). Retrieved 4 June 2020,

from http://archive.ipu.org/wmn-e/world.htm.

X. Zhang, J. Zhao, Y. LeCun (2016). Character-level Convolutional Networks for text classification.

arXiv.

Zimbardo, P., Johnson, R., & McCann, V. (2016). Psychology: Core concepts (8th ed.). Amsterdam:

Pearson Benelux.

Zion market research. (2017). Machine Learning Market by Service (Professional Services, and

Managed Services), for BFSI, Healthcare and Life Science, Retail, Telecommunication, Government

and Defense, Manufacturing, Energy and Utilities, Others: Global Industry Perspective,

Comprehensive Analysis, and Forecast, 2017-2024. Zion market research.

Zuo, Y. (2019). Research and implementation of human-autonomous devices for sports training

management decision making based on wavelet neural network. Journal Of Ambient Intelligence And

Humanized Computing. doi: 10.1007/s12652-019-01511-y

THE INFLUENCE OF INTERPRETABLEMACHINE LEARNING ON HUMAN

ACCURACY

By: R.D. Sturm

THEORETICAL AREA OF FOCUS

• Decision making

• Machine learning

• Machine learning decision support

“There are very few examples of people

outperforming algorithms in making

predictive judgments. So when there’s the

possibility of using an algorithm, people

should use it” D. Kahneman

CONCEPTUAL FRAMEWORK

• Complex algorithms are on average more accurate than

human agents (Dawes, 1979)

• Providing an explanation increases acceptance and trust

(Herlocker, Konstan, Riedl, 2000; Dzindolet et al, 2003)

• Low trust will cause low acceptance due to “silly mistakes”

(Guidotti et al, 2018)

CONCEPTUAL MODEL

• The more complex a model is, the harder it is to

understand

• But the easier it is to understand the model, the higher the

accuracy of the human agent

• An explanatory mechanism increases understandability

• Secondly, the explantory mechanism increases trust, which

in turn increases accuracy

RESEARCH DESIGN

• Development of a neural network with a LIME explanatory

mechanism

• Experimental survey where participants try to estimate

survival-rates of passengers on the Titanic

• Two groups: treatment-group and a control-group

• Analysis of the data

MODEL DEVELOPMENT AND SURVEY

• Dataset: Titanic passangers, 891 observations.

• Simple neural network with three hidden layers

• LIME-explanatory mechanism

• Survey collected with Qualtrics, 122 respondents

ANALYSIS


mechanism the human agent increases in accuracy.

Estimate: 0.1037, p = 0.0003


mechanism the human agent sees the machine learning model as more

trustworthy

Estimate = 0.8955, p = 8.8e -08


mechanism the human agent sees the machine learning model as more

interpretable.

Estimate = 0.3966, p = 0.0003

H4: the age of the human agent influences the causal link between

machine learning support and human agent accuracy negatively.

Estimate = -0.0008, p = 0.4748 (REJECTED)

H5: the gender of the human agent influences the causal link between

machine learning support and human agent accuracy.

Estimate = -0.0643, p = 0.0329

DISCUSSION

• Empirical evidence of improved performance

• Exact cause improved performance unclear

• Sample size was small and homogenous

• Classification task not respresentative for all decisions

IMPLICATION

• Practical: systematic decicions can be aided by machine

learning model

• Theoretical: interpretation techniques can measure

effectiveness on increased accuracy

QUESTIONS

The influence of interpretable machine learning on human ...

Documents

Transcript of The influence of interpretable machine learning on human ...