Identifying expectations about the strength of causal ... · causal relationship. More recently,...

29
Identifying expectations about the strength of causal relationships Saiwing Yeung a,, Thomas L. Griffiths b a Institute of Education, Beijing Institute of Technology, China b Department of Psychology, University of California, Berkeley, United States article info Article history: Accepted 3 November 2014 Keywords: Causal reasoning Bayesian models Computational modeling Iterated learning abstract When we try to identify causal relationships, how strong do we expect that relationship to be? Bayesian models of causal induction rely on assumptions regarding people’s a priori beliefs about causal systems, with recent research focusing on people’s expectations about the strength of causes. These expectations are expressed in terms of prior probability distributions. While proposals about the form of such prior distributions have been made previously, many different distributions are possible, making it difficult to test such proposals exhaustively. In Experiment 1 we used iterated learning—a method in which participants make inferences about data generated based on their own responses in previous trials— to estimate participants’ prior beliefs about the strengths of causes. This method produced estimated prior distributions that were quite different from those previously proposed in the literature. Experi- ment 2 collected a large set of human judgments on the strength of causal relationships to be used as a benchmark for evaluating dif- ferent models, using stimuli that cover a wider and more systematic set of contingencies than previous research. Using these judgments, we evaluated the predictions of various Bayesian models. The Bayesian model with priors estimated via iterated learning com- pared favorably against the others. Experiment 3 estimated partic- ipants’ prior beliefs concerning different causal systems, revealing key similarities in their expectations across diverse scenarios. Ó 2014 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cogpsych.2014.11.001 0010-0285/Ó 2014 Elsevier Inc. All rights reserved. Corresponding author at: Institute of Education, 5 South Zhongguancun Street, Haidian District, Beijing 100081, China. E-mail address: [email protected] (S. Yeung). Cognitive Psychology 76 (2015) 1–29 Contents lists available at ScienceDirect Cognitive Psychology journal homepage: www.elsevier.com/locate/cogpsych

Transcript of Identifying expectations about the strength of causal ... · causal relationship. More recently,...

Page 1: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

Cognitive Psychology 76 (2015) 1–29

Contents lists available at ScienceDirect

Cognitive Psychology

journal homepage: www.elsevier .com/locate/cogpsych

Identifying expectations about the strengthof causal relationships

http://dx.doi.org/10.1016/j.cogpsych.2014.11.0010010-0285/� 2014 Elsevier Inc. All rights reserved.

⇑ Corresponding author at: Institute of Education, 5 South Zhongguancun Street, Haidian District, Beijing 100081,E-mail address: [email protected] (S. Yeung).

Saiwing Yeung a,⇑, Thomas L. Griffiths b

a Institute of Education, Beijing Institute of Technology, Chinab Department of Psychology, University of California, Berkeley, United States

a r t i c l e i n f o

Article history:Accepted 3 November 2014

Keywords:Causal reasoningBayesian modelsComputational modelingIterated learning

a b s t r a c t

When we try to identify causal relationships, how strong do weexpect that relationship to be? Bayesian models of causal inductionrely on assumptions regarding people’s a priori beliefs about causalsystems, with recent research focusing on people’s expectationsabout the strength of causes. These expectations are expressed interms of prior probability distributions. While proposals aboutthe form of such prior distributions have been made previously,many different distributions are possible, making it difficult to testsuch proposals exhaustively. In Experiment 1 we used iteratedlearning—a method in which participants make inferences aboutdata generated based on their own responses in previous trials—to estimate participants’ prior beliefs about the strengths of causes.This method produced estimated prior distributions that were quitedifferent from those previously proposed in the literature. Experi-ment 2 collected a large set of human judgments on the strengthof causal relationships to be used as a benchmark for evaluating dif-ferent models, using stimuli that cover a wider and more systematicset of contingencies than previous research. Using these judgments,we evaluated the predictions of various Bayesian models. TheBayesian model with priors estimated via iterated learning com-pared favorably against the others. Experiment 3 estimated partic-ipants’ prior beliefs concerning different causal systems, revealingkey similarities in their expectations across diverse scenarios.

� 2014 Elsevier Inc. All rights reserved.

China.

Page 2: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

2 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

1. Introduction

Inferring the relationship between causes and effects is an important skill that people rely onevery day in order to understand the structure of their environment. Psychological models of humancausal induction have often focused on the role of associative learning—how people’s judgmentsmight be related to the number of instances of an effect occurring in the presence and absence ofa cause (e.g., Cheng, 1997; Shanks, 1995; Ward & Jenkins, 1965). Recent work has explored howideas from Bayesian statistics might help to explain people’s intuitions, using causal graphical mod-els to precisely define the problem of causal induction (Griffiths & Tenenbaum, 2005; Lu, Yuille,Liljeholm, Cheng, & Holyoak, 2008) and to formalize the influence of prior knowledge (Griffiths &Tenenbaum, 2009).

A key part of Bayesian models of causal induction is the assumptions they make about the expec-tations people have about causal relationships. These assumptions are expressed in the form of priordistributions (often shortened to just priors), and reflect the expectations of the learners about causalsystems prior to seeing any data. Bayesian models of human causal induction focus especially on thelearners’ prior beliefs about the strength of causal relationships, and two different specifications havebeen proposed thus far—using either a non-informative prior that makes no assumptions about theprior beliefs (Griffiths & Tenenbaum, 2005), or a prior based on theoretical assumptions about people’sinductive biases (Lu et al., 2008). In this paper we present a new approach to identifying the expecta-tions people have about causal systems, using the technique of iterated learning (Griffiths & Kalish,2007; Griffiths, Christian, & Kalish, 2008). This technique allows us to estimate the values of unobserv-able cognitive constructs, such as people’s priors, and produced predictions about human judgmentsthat are more accurate than previous methods.

The plan for the rest of the paper is as follows. In the next section we summarize relevant previouswork on Bayesian models of human causal induction. We then introduce the basic ideas behind iter-ated learning, and apply it in Experiment 1 to empirically estimate participants’ prior beliefs aboutcausal systems. Next we discussed a potential issue in previous work on human causal induction—models of causal induction are often evaluated based on small sets of stimuli that have certain specificcharacteristics. To address this issue, in Experiment 2 we collected a large benchmark data set againstwhich the performance of different models can be better compared. We followed up with an investi-gation on prior beliefs about several different causal systems in Experiment 3. Finally we concludewith a discussion of the implications of these results for understanding causal induction as well assome possible future directions.

2. Bayesian models of human causal induction

Historically, there have been two major approaches to psychological theories of causal induction(Newsome, 2003). The mechanism-based approach focuses on understanding how knowledge aboutthe causal mechanisms influences reasoning. Here learners attempt to discover a process in whichcausal power possessed by entities is transmitted to generate the event (Ahn, Kalish, Medin, &Gelman, 1995). In contrast, the covariation-based approach focuses on how people use the contin-gency about cause and effect to identify causal relationships (Cheng, 1997; Shanks, 1995). In particu-lar, much psychological research on causal induction has focused on the problem of elemental causalinduction: given a number of observations of two binary variables with a plausible causal relationship,how do people assess the relationship between them? For example, we can imagine a scientist study-ing the effect of a certain chemical on clovers. While most clovers have three leaflets, some have four,and the scientist wants to know whether the presence of the chemical has the power to increase theproportion of four-leaf clovers. By planting a number of clover plants and applies this chemical onsome of them, he can use the resulted contingency data—number of clover plants exposed to or notexposed to the chemical, number of four-leaf clovers resulted in each case—to evaluate this potentialcausal relationship.

More recently, researchers have proposed various models based on Bayesian statistics. While thesemodels rely on covariation data, they explicitly represent the learners’ subjective beliefs about the

Page 3: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 3

causal systems, and formally express how they update their beliefs after observing data. These modelshave been found to be able to make good predictions about human judgments, and could explaineffects not predicted by previous models (Griffiths & Tenenbaum, 2005; Lu et al., 2008). This paper willfocus on these Bayesian models.1

2.1. Bayesian inference

Bayesian models of causal induction are special cases of the more general approach of modelinghuman inductive inferences as applications of Bayesian statistics (Tenenbaum, Griffiths, & Kemp,2006; Tenenbaum, Kemp, Griffiths, & Goodman, 2011). The principles of Bayesian inference indicatehow a rational learner should update his or her beliefs in light of evidence. In the context of causalinduction, this approach involves formally specifying a learner’s a priori beliefs concerning a causalsystem in terms of prior probability, and a likelihood function that describes the likelihood of observ-ing certain data given these prior beliefs. The prior probability and the likelihood function combine togive the posterior probability, which represents the learner’s updated beliefs concerning the causalsystem.

Formally, we represent a learner’s a priori subjective beliefs concerning a causal system using a setof hypotheses H, in which each hypothesis h 2 H has a prior probability PðhÞ that reflects the agent’sdegree of belief in the hypothesis being true.2 Bayes’ rule indicates that the degree of belief in eachhypothesis after observing data d is given by

1 A dconting(Cheng(2007).

2 Indistribu

PðhjdÞ ¼ PðdjhÞPðhÞPh02HPðdjh0ÞPðh0Þ

where PðdjhÞ is the likelihood and PðhjdÞ is the posterior probability. The posterior probability thusrepresents the learner’s belief after the observations, and can be compared to empirical data as away to evaluate to what extent the proposed model can approximate people’s computations in causalinduction.

Because these Bayesian models rely on computations concerning people’s beliefs, these beliefsneed to be formalized mathematically. However, formally specifying a prior distribution that capturesthe expectations of humans (or even just a select group of a more homogeneous population) is partic-ularly difficult, and will be the main concern of the present paper.

2.2. Modeling causal induction

Griffiths and Tenenbaum (2005) presented a Bayesian analysis of causal induction, in the spirit ofMarr (1982) and Anderson (1991). This analysis used causal graphical models to formulate the prob-lem of causal induction and outlined a rational solution to this problem based on Bayesian inference.Causal graphical models are probabilistic models in which graphs are used to denote the causal rela-tionships between variables (Pearl, 2009; Spirtes, Glymour, & Scheines, 2001). In these graphs, nodesrepresent variables and edges represent the causal connections between those variables. An elementalcausal system is formalized using three variables: the background cause B, the candidate cause C, andthe effect or outcome E (Fig. 1). In the scientist example discussed earlier, we have an observed effect E(four-leaf clovers), which might be caused by the candidate cause C (the chemical), or the backgroundcause B (representing any other causes that are not of interest). We assume that both B and C cancause E, and this relationship is expressed by edges going from both B and C to E. We further assumeB is always present and is generative, increasing the probability of the outcome, while C can be presentor absent, and generative or preventive. In the generative case, either B or C can cause E; in the

ifferent class of covariation-based computational models of causal reasoning focuses on deriving algebraically from aency table people’s evaluation of causal relationships. These models include DP (Ward & Jenkins, 1965), causal power, 1997), EI rule (Perales & Shanks, 2007), etc. A comprehensive treatment of these models appears in Perales and Shanks

this paper we use the convention of using pðÞ to represent continuous distributions, and PðÞ to represent discretetions and proportions.

Page 4: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

B C

E

w0 w1

Fig. 1. Causal graphical model. The elemental model involves three variables: B the background cause, C the candidate cause,and E the effect of interest. While B is always present, C and E might be present or absent. B and C are assumed to beindependent. B is always generative; C could be generative or preventive. Independently, B and C have probabilities w0 and w1,respectively, of influencing E. Their joint functional form is formalized using noisy-OR in the generative case and noisy-AND-NOT in the preventive case.

4 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

preventive case, only B can cause E, while C may suppress E. Finally, E cannot occur unless B or Ccaused it. Each cause is assumed to have the power to cause (or prevent) the effect independently,doing so with a probability that reflects its strength. We denote the strengths of B and C as w0 andw1 respectively.

Although the graph structure specifies the causal relationships among variables, the exact natureof those relationships is not clear without specifying their functional forms. Functional forms are theformal specifications associating the states of causes to the probability of different outcomes, andare a key component in the likelihood function. To define the functional form of a causal relation-ship, we need to make the distinction between generative and preventive causes—generative causeshave the power to produce the effect, whereas preventive causes have the power to prevent theeffect from occurring. While many early theories of causal induction did not make an explicit dis-tinction between these kinds of causal relationships, more recent research suggested that peoplereason about generative and preventive causal relationships differently (Dennis & Ahn, 2001;Williams & Docking, 1995). As a consequence, recent models of causal induction have distinct vari-ants for characterizing judgments about generative and preventive causes (Cheng, 1997; Griffiths &Tenenbaum, 2005).

In previous models of causal induction, noisy-OR (for generative causes) and noisy-AND-NOT (forpreventive causes) parameterizations have been used to characterize the functional form of causalrelationships (Cheng, 1997; Griffiths & Tenenbaum, 2005). Noisy-OR gives the probability of observingthe effect E as

pðeþjc; w0;w1Þ ¼ 1� ð1�w0Þð1�w1Þc ð1Þ

where c is a binary value representing the presence or absence of C with c ¼ 1 for cþ (presence of thecandidate cause) and c ¼ 0 for c� (absence of the candidate cause), whereas noisy-AND-NOT gives

pðeþjc; w0;w1Þ ¼ w0ð1�w1Þc ð2Þ

Again, w0 and w1 represent the strength of B and C respectively. These formulations reflect the aboveassumptions we made concerning elemental causal induction.

Recent work by Lu et al. (2008) demonstrated how this framework could be used to make accuratepredictions about people’s estimates of the strength of causal relationships. Here the hypotheses spaceis represented by the prior probability over w0 and w1. For any particular value of w0 and w1, the like-lihood function for the observed contingency data d is

pðdjw0;w1Þ ¼Ye;c

pðejc; w0;w1ÞNðe;cÞ ð3Þ

Page 5: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 5

where pðejc; w0;w1Þ is given by the noisy-OR or noisy-AND-NOT as specified above. Nðe; cÞ representsthe frequencies of events associated with different combinations of presence and absence of the effectand cause. Each of e and c can be either + or �, representing presence or absence, respectively.3

We can thus compute a posterior distribution over w0 and w1 given d by applying Bayes’ rule, with

3 Foland cauabsencerepresefrom thcalculatpresentcan be

4 Formodels

pðw0;w1jdÞ / pðdjw0;w1Þpðw0;w1Þ ð4Þ

where pðw0;w1Þ represents the prior probability of a (w0;w1) pair. While the posterior probability rep-resents a learner’s updated beliefs over the entire hypothesis space, we can find point estimates of theposterior of w0 and w1 by computing the mean of the posterior distribution. For example, posteriormean of w1 computed based on

�wi ¼Z 1

0wipðw0;w1jdÞdwi for i 2 0;1 ð5Þ

has been found to be able to approximate closely people’s judgments about the strength of causes (Luet al., 2008).

2.3. Priors on causal strength

Intuitively, a prior distribution on causal strengths encapsulates the learner’s expectations aboutthe strengths of causes, which corresponds to the probability with which they bring about (or prevent)the effect. In an elemental causal induction model, there are two relevant causes—the backgroundcause B and the candidate cause C—and therefore their joint prior probability is the key to understand-ing human causal induction. Griffiths and Tenenbaum (2005) assumed uniform priors—the simplestnon-informative prior—on both variables in their structure-learning model. Under this parameteriza-tion, equal value is assigned to all possible values of w0 and w1, reflecting indifference about their val-ues (Jaynes & Bretthorst, 2003). In spite of its simplicity and lack of free parameters, the uniformmodel provided a good fit to human judgments about causal structure.4

Lu et al. (2008) applied this approach to model how people reason about strengths of causes andargued that human causal induction is better explained using a model that incorporates generic pri-ors—a theoretically driven set of prior distributions that are derived from assumptions about theabstract properties of a system. They argued that people favor necessary and sufficient causes, andtherefore people consider causal models that have fewer causes (Chater & Vitányi, 2003) and haveno complex interactions (Novick & Cheng, 2004) to be more likely. As elemental causal inductioninvolves only one background cause and one candidate cause, this suggests that people prefer a causalsystem in which only one of the two causes is necessary and sufficient to generate the effect. Based onthese arguments, Lu et al. (2008) specified the sparse and strong prior (SS prior) as

pðw0;w1Þ / e�aðw0þ1�w1Þ þ e�að1�w0þw1Þ ð6Þ

in the generative case and

pðw0;w1Þ / e�að1�w0þ1�w1Þ þ e�að1�w0þw1Þ ð7Þ

in the preventive case. a in the formulae is a free parameter that represents how strongly one believesthat causes are sparse and strong. If a is set to 0 then the SS priors are identical to a uniform distribu-

lowing Griffiths and Tenenbaum (2005), we use a notation that semantically encodes the presence or absence of the effectse—e and c represent the effect (or outcome) and candidate cause, and the superscripts + and � represent their presence and, respectively. For example, the number of instances in which the effect is present but the cause is absent would be

nted by Nðeþ ; c�Þ. In some cases, it is convenient to use the relative frequencies of the outcomes. These can be obtainede Nðe; cÞ values. For example, the relative frequency of the effect given the presence of the cause is PðeþjcþÞ and can beed by Nðeþ ;cþÞ

NðcþÞ , where NðcþÞ denotes the number of cases in which the cause is present, regardless of whether the effect isor not. Similarly, Nðc�Þ denotes the number of cases in which the cause is absent and the relative frequency Pðeþjc�Þ and

calculated by Nðeþ ;c�ÞNðc�Þ .

the ease of exposition, in the remainder of this paper, we will refer to Bayesian models that use different priors as different, such as ‘‘the uniform model’’ instead of ‘‘the Bayesian model using a uniform prior’’.

Page 6: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

6 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

tion. Lu et al. (2008) fixed a at 5 in their analysis, based on an informal grid search using a prior dataset.

The SS priors are plotted as two-dimensional histograms in Fig. 2. Analogous to one-dimensionalhistograms, these figures indicates the probability density at different values of the underlying vari-ables (w0 and w1). From these figures we can observe the main features of the SS priors. In the gen-erative case, the probability is high when one of the causes, either B or C, is very strong and theother is very weak; in the preventive case, the probability is high when B is very strong and C is eithervery strong or very weak.

The SS model provided a good fit to human causal strength judgments, and produced predictionsconsistent with qualitative effects that a model with a uniform prior does not produce (Lu et al., 2008).This finding suggests that exploring the space of possible prior distributions might be a valuable wayof gaining deeper insight into human causal induction, as well as better predictions of human behav-ior. However, prior distributions on causal strength could take any form that corresponds to a prob-ability distribution on two variables ranging from 0 to 1, and need not correspond to a distributionfrom a simple parametric family. This means that the SS priors represent only one possibility out ofuncountably many. Evaluating all possible priors is thus infeasible. Ideally, the prior distributionshould be estimated without having to make a priori assumptions about its form. In the next sectionwe explore how the technique of iterated learning might be used to resolve this issue.

3. Iterated learning as a method for estimating priors

Iterated learning was originally proposed as a model of cultural transmission of languages (Kirby,2001). It refers to a process in which a sequence of agents each learns from data generated by the pre-vious agent. In the simplest such model we imagine a chain of agents, where each agent observes datagenerated by the previous agent (such as a set of utterances), forms a hypothesis about the processthat generated those data (such as a language), and then generates new data to pass to the next agent.

Formally, the n-th agent in the chain observes data dðnÞ and forms a hypothesis hðnÞ about the pro-cess that generated those data, then goes on to generate data dðnþ1Þ, which is given to the next agent,and so on. This iterative procedure defines a Markov chain—a sequence of random variables in which

00.2

0.40.6

0.81

00.2

0.40.6

0.810

2

4

6

8

10

12

w0

Generative

w1

Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

2

4

6

8

10

12

w0

Preventive

w1

Probability

Fig. 2. Sparse and strong (SS) priors (Eqs. (6) and (7)) proposed by Lu et al. (2008). Each panel shows the prior distribution foreither the generative or the preventive cause. The two horizontal axes correspond to the strength of the background cause (w0)and the candidate cause (w1). The vertical axis indicates the relative probability density at each ðw0;w1Þ pair. Note that thedensity was normalized before plotting (i.e., the volume under the surface is 1). Therefore the density values do not match thosegiven directly by the formulae.

Page 7: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 7

each variable’s transition depends only on the current state—on hypothesis-data pairs. That is, eachstep of the Markov chain here consists of the hypothesis about the causal system and the data pro-duced, and the hypothesis-data pair at each step depends only on its immediate previous iteration.

Furthermore, if the agents select a hypothesis by sampling from the posterior distributionpðhjdÞ / pðdjhÞpðhÞ and then generate data by sampling from the corresponding likelihood functionpðdjhÞ, then this Markov chain forms a Gibbs sampler for the joint distribution pðd;hÞ ¼ pðdjhÞpðhÞ(Griffiths & Kalish, 2007). A Gibbs sampler is a specific kind of Markov chain, commonly used by com-puter scientists and statisticians to generate samples from a multivariate probability distribution thatnormally might be difficult to sample from Gilks, Richardson, and Spiegelhalter (1996).

More formally, a Gibbs sampler provides a way to sample from the joint probability distribution fora set of variables X ¼ ðx1; . . . ; xnÞ. This procedure involves an initial stage and a number of iterativesteps. We first start with some initial values for each of the variables, and then at each step, samplethe value of one of the variables in the target distribution by drawing a sample from the distributionof this variable, conditioned on the current values of all other variables. That is, at each iteration, thealgorithm cycles through the variables and draws a new value for each variable xi from the conditionaldistribution pðxijX�iÞ. This process constructs a Markov chain that, under mild assumptions, will ulti-mately converge to its stationary distribution pðXÞ, at which point samples can be taken as an approx-imation of the target distribution.

In an iterated learning experiment, the procedure of alternately sampling hypotheses and datamimics the structure of the Gibbs sampler, allowing the same convergence results to be applied(Griffiths & Kalish, 2007). In this case, samples of h after convergence can be taken as samples gener-ated from the prior distribution pðhÞ. In other words, the probability that a participant selects a par-ticular hypothesis will converge to her prior. Carrying out this process of iterated learning inexperiments thus enables us to empirically estimate people’s priors. This result holds true regardlessof what data are provided to the subjects in the first iteration.

The convergence of iterated learning to the prior suggests that it can be used as a method forexploring people’s expectations about different hypotheses. Previous experiments using iteratedlearning, with data being passed between people, support this idea: functions (Kalish, Griffiths, &Lewandowsky, 2007), concepts (Griffiths et al., 2008), and color terms (Xu, Dowman, & Griffiths,2013) transmitted through iterated learning quickly converge to forms that are consistent with priorsestablished in previous research. Moreover, there is no need for data to be transmitted between peoplefor this to occur—a feedback process that has the appropriate statistical structure can be establishedfor single individuals. In these within-subjects iterated learning experiments, participants formhypotheses about data that are generated based on their own responses on previous trials. This exper-imental design has previously been used to explore people’s inductive biases in concept learning andmemory, producing equivalent results to a between-subjects design (Griffiths et al., 2008; Xu &Griffiths, 2010).

In this paper we use the iterated learning approach in experiments to directly estimate the priordistributions of the strength parameters (w0 and w1) in people’s subjective beliefs concerning causalsystems. This method could inform recent discussion about appropriate prior distributions to use inBayesian models of causal learning, and result in more accurate predictions of human judgments. Inparticular, iterated learning allows us to identify people’s expectations about the strength of causalrelationships without needing to make any assumptions about the form of the corresponding priordistributions. We will first demonstrate how this approach can be used to identify participants’ priorbeliefs in a simple biological scenario.

4. Experiment 1: Using iterated learning to estimate human priors on causal strength

The objective of Experiment 1 is to estimate people’s priors in elemental causal induction using thetechnique of iterated learning. The priors obtained, representing the experiment participants’ expec-tations about the strengths of the causes, can be used to evaluate previous claims about human priorbeliefs.

Page 8: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

8 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

4.1. Methods

4.1.1. ParticipantsA total of 432 participants were recruited from the University of California, Berkeley subject pool

and Amazon Mechanical Turk (MTurk). Collecting data from MTurk in addition to a more traditionaluniversity subject pool allowed us to obtain data that are more representative of the entire populationthan just the university sample, and to run a larger scale experiment. We recruited only MTurk work-ers from within the United States. Recent studies have shown MTurk to be a reliable source of exper-imental data (Buhrmester, Kwang, & Gosling, 2011; Sprouse, 2011) and many classical experimentswere successfully replicated using this service (Crump, McDonnell, & Gureckis, 2013). Participantsfrom the university subject pool received course credit, while MTurk participants received a paymentof US$1.5 Only data from participants who completed at least 95% of the trials were included in the anal-ysis. In each of the generative or preventive condition, there were 36 participants from the universitysubject pool and 180 from MTurk.

In the university subject pool, 55% of participants were female and the mean age was 20.1. In theMTurk subject pool, 49% were female. With MTurk participants, we requested age information inbrackets. The biggest age group was 23–35, with 50% of our participants, and 78% had at least somecollege education. The demographics of the MTurk subject pool were thus similar to our universitysubject pool, other than their slightly higher age, and are also commensurate with other publishedreports on the demographics of MTurk workers (Ross, Irani, Silberman, Zaldivar, & Tomlinson, 2010).

4.1.2. Stimuli and procedureThe experiment was run in a web browser. We used a biological cover story that most closely

resembled one that was used in Lu et al. (2008), but cover stories of similar themes have been usedin many previous studies of causal induction (e.g., Buehner, Cheng, & Clifford, 2003; Griffiths &Tenenbaum, 2005; Novick & Cheng, 2004). We chose to use a similar cover story for two reasons. First,as we are using the novel experimental technique of iterated learning, having a well-tested cover storyand associated stimuli could spare us from having to evaluate multiple new experimental elements atthe same time. Second, reusing the same materials allows better comparison between different mod-els, as most of these models were originally evaluated based on experiments using similar stimuli.

The experiment was presented in the context of a bio-technology company testing the influence ofvarious proteins on gene expression. In the generative condition, participants read:

In this experiment, please imagine that you are a researcher working for a bio-technology companyand you are studying the relationship between genes and proteins concerning gene expression.Gene expression is a process that controls the structure and functions of cells or other genes. Thisprocess may or may not be modulated by the presence of proteins. You are given a number of gene/protein pairs and your job is to make predictions concerning the effect of these proteins on geneexpression.There are a number of trials in this experiment and each trial involves a different gene and a dif-ferent protein. In each trial, you will be given information about some past results involving thisgene/protein pair and you will be asked to make some predictions based on these information.The past results consist of two samples: 1) a sample of DNA fragments that had not been exposedto the protein, and 2) a sample of DNA fragments that had been exposed to the protein. The numberof DNA fragments that resulted in gene expression in each of these samples will be shown to you.Because there are many causes of gene expression, other factors besides the presence or absence ofthe protein might play a role in whether the gene is expressed or not.

Participants then received instructions familiarizing them with the controls that they would use inthe experiment. Only one practice trial was used in order to minimize effect of practice on partici-pants’ expectations, as each practice trial involves a certain set of contingency that could impart influ-ence on participants’ beliefs about the novel causal system. Each trial was presented on a separate

5 The amounts of payment for experiments in this paper were comparable to similar experiments on MTurk (Mason & Suri,2012).

Page 9: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

Fig. 3. Screenshot of the generative condition of the experiment. The participant is assessing the strength of the cause C;w1. Inthis particular trial, Nðeþ ;c�Þ

Nðc�Þ ¼ 316 and Nðeþ ;cþÞ

NðcþÞ ¼ 716.

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 9

screen. Fig. 3 shows a screenshot of the experiment. In each trial participants saw data in the form oftwo samples, one that was not exposed to the protein (Nðe; c�Þ) and one that was (Nðe; cþÞ). The datawere presented graphically, displaying the total number of DNA fragments in each sample as well asthe number that expressed the gene, thus providing complete contingency data.

After observing these contingencies, participants were asked to make two judgments involvinghypothetical samples. The instructions for the two judgments were:

Suppose that there is a sample of 100 DNA fragments and these fragments were not exposed to theprotein, in how many of them would the gene be turned on?

Suppose that there is a sample of 100 DNA fragments and that the gene is currently off in all thoseDNA fragments. If these 100 fragments were exposed to the protein, in how many of them wouldthe gene be turned on?

These questions were similar to those used in previous research on causal strength judgments foreliciting judgments of w0 and w1, respectively (e.g., Lu et al., 2008). Participants responded using a sli-der, the initial value of which was drawn from a uniform distribution between 0 and 100. Live feed-back showing the proportion of expressed genes was shown as the slider was moved. Past studies haveshown that graphically presented information can better convey probabilistic data (Edwards, Elwyn, &Mulley, 2002). Participants could adjust the slider until they were satisfied with their response, beforeclicking the submit button to record their response and continue to the next trial. The instructions forthe preventive condition were similar, except that for the second judgment, the instructions were:

Suppose that there is a sample of 100 DNA fragments and that the gene is currently on in all thoseDNA fragments. If these 100 fragments were exposed to the protein, in how many of them wouldthe gene be turned off?

A within-subjects iterated learning design was used. There were four chains each with 12 itera-tions, for a total of 48 trials. The data for the first iteration of each chain were generated based onthe initial values of ðw0;w1Þ ¼ ð0:3;0:3Þ; ð0:3;0:7Þ; ð0:7;0:3Þ, and ð0:7;0:7Þ. These values were chosento be distinct so that the responses from different chains could be used to diagnose convergence.

Page 10: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

10 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

In each trial a set of contingency data consisting of 16 cases in which the cause was present and 16 inwhich the cause was absent was shown. The number of times that the effect occurred was generated viaa binomial draw with parameter pðeþjc; w0;w1Þ, using the noisy-OR functional form (Eq. (1)) in the gen-erative case and noisy-AND-NOT (Eq. (2)) in the preventive case. In all trials other than the first one ineach chain, data were generated based on the w0 and w1 values the participant produced in the previ-ous trial in the same chain. These values were taken directly from the estimates that the participantsproduced in response to the two questions about hypothetical samples. For example, if on iterationn of a particular chain the participant’s responses were f 0 and f 1 (out of 100) for the two questions, thenthe data presented at iteration nþ 1 of same chain would be drawn using w0 ¼ f 0=100 andw1 ¼ f 1=100. Additionally, the order of the trials was randomized after stratification by iteration. Forexample, the first four trials for each participant correspond to the first iteration of each of the fourchains, but the order among these four trials was randomized for each subject. Similarly, the fifth tothe eighth trials correspond to the second iteration for each of the four chains, and so on.

4.2. Results

We first compared the results between the university and MTurk subject pools. The numbers ofcontingencies in the data were not fixed because of the stochasticity in the sampling of stimuli. Therewere 181 and 165 contingencies in the generative and preventive conditions respectively, for whichthere were data in both pools. We ran a Mann–Whitney U test on these contingencies and found thatthere were significant differences (with p < :05) between the two subject pools in 10 and 5 contingen-cies respectively. None of these differences were significant if Bonferroni correction is applied. There-fore results from these two sources were combined.

4.2.1. Analysis of convergenceBefore we compute the prior distributions using the empirical data, we tested whether the data

indeed have reached convergence. Because of the divergent initial values of the four chains, weexpected the human judgments for both w0 and w1 to be significantly different across chains of differ-ent initial values in the first few iterations. As the iterated learning experiment progresses, thehypotheses-data pair should eventually converge to the prior distribution and will no longer be signif-icantly different across chains. We will compare the human judgments between initial value condi-tions to diagnose whether the chains have converged. This convergence diagnosis method is similarto those used in Markov chain Monte Carlo methods in statistics (Gilks et al., 1996).

In the first iteration the judgments for both w0 and w1 were significantly different, indicating that, aswe had expected, the chains have not converged. The result was quite different for the final (12th) iter-ation. In the generative condition, there were no significant differences between chains with differentinitial w1 values (Fð1;860Þ ¼ 0:47;MSE ¼ 519:6; p ¼ 0:49). However, the w0 judgments from chains ofhigher initial w0 values remained higher than those from chains of lower initial w0 values(Fð1;860Þ ¼ 58:31;MSE ¼ 79;945; p < 0:001). The results were similar in the preventive condition.The w1 judgments from chains with different initial values were not significantly different (Fð1;860Þ¼ 0:06;MSE ¼ 65:0; p ¼ 0:81), while the w0 judgments were (Fð1;859Þ ¼ 61:00;MSE¼ 83;128; p < 0:001). The means and standard deviations of the chains at the final iteration are shownin Table 1.

The significant effect of initial values for w0 but not w1 in both conditions indicates that the w1

chains had converged while the w0 chains still retained some influence from the initial values. How-ever, we believe that samples from the final (12th) iteration give a fairly good approximation of theprior distributions for two reasons. First, while the w0 chains did not converge, they did move inthe expected directions towards converging. It is likely that they would converge given an experimentwith more iterations.6 Second, as the focus of this experiment is on human judgments of the strength of

6 While it is possible to collect human data for more iterations until w0 converges, a mathematical analysis suggests that thismight not be practical. We conducted a simulation using the uniform prior to estimate how long it would take to converge, usingthe same initial w0 and w1 values as the experiment. Using the same criteria for convergence as in the experiment, we found thaton average, it took about 30 iterations for w0 to converge. With four chains (to account for low and high initial values of w0 andw1), it would take a total of 120 trials, which might be too many that boredom and fatigue could become a factor.

Page 11: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 11

the candidate cause, which is most strongly influenced by the prior distribution of w1, the non-completeconvergence of w0 presents a more minor issue than if w1 did not converge. In fact, the w1 chains in bothcausal directions converged by the third iteration, suggesting that humans have fairly strong priors aboutthe strength of the candidate cause.

4.2.2. Estimation of the empirical priors and discussionTo construct the empirical priors, we treated judgments of w0 and w1 in the final iteration as sam-

ples from participants’ prior distributions. Because there were four chains per participant, each partic-ipant contributed four data points, each in the form of a (w0;w1) vector. We aggregated the (w0;w1)vectors from all participants and performed a two-dimensional density estimation. The top row ofFig. 4 shows a smoothed estimate of the density of the empirical priors over w0 and w1. The smoothingwas performed via kernel density estimation with a bivariate normal kernel (Venables & Ripley, 2002).

The resulting empirical prior distributions, shown in Fig. 4, have complex forms that do not corre-spond to parametric probability distributions. Moreover, their forms are quite different from previousproposals about people’s prior distributions, such as the SS priors. Some major differences betweenthe empirical and the SS priors can be observed based on visual inspection. First, the SS theory sug-gests that people expect either the candidate cause or the background cause to be strong. In contraryto this assumption, in the empirical priors, the density is much higher at regions where w1 is high, inboth causal directions. In other words, participants have a strong prior for a strong candidate cause,regardless of whether the background cause is strong or not.

Second, the SS priors are relatively straightforward probability distributions. Here probability is thehighest at the two peaks (which have the same density), and from there the probability goes downstrictly. Moreover, in both the generative and preventive cases, the two non-peak corners have thelowest prior probability of the entire distribution. For example, in the generative case, the peaks areat ðw0 ¼ 0;w1 ¼ 1Þ and ð1;0Þ while the other two corners of ð0;0Þ and ð1;1Þ have the lowest priorprobability of the entire space, and similarly in the preventive case. In contrast, the empirical priorshave multiple local maxima and non-monotonicities. Additionally, all four corners in the empiricalpriors have higher probabilities than their neighborhood areas, in both the generative and preventivecases. The four corners in the probability space are associated with deterministic causal systems—if w0

and w1 are either 0 or 1, then the outcome is controlled deterministically based on the values of w0;w1,and c. This pattern found in the empirical priors is consistent with findings from previous research onhuman causal induction that people have a tendency to assume causal relationships to be determin-istic (Frosch & Johnson-Laird, 2011; Griffiths & Tenenbaum, 2009; Schulz & Sommerville, 2006). Over-all, the results based on the empirical priors seem to be in variance with the assumption that peoplehave a sparse and strong prior.

Having empirically identified the priors on causal strength, we can now use them to predict howpeople reason about causes. If the empirical priors are indeed better characterizations of a popula-tion’s expectations about causal systems, then a model using these priors should perform better thanother models in capturing their judgments about causal strength. However, we will first turn to a dis-cussion about how models should be evaluated. Most previous comparisons of computational modelshave focused on specific sets of contingencies that are designed to contrast the predictions made bydifferent models. For example, Lober and Shanks (2000) compared the performance of the DP modeland the causal power model by selecting sets of contingencies in which one model would predict thesame value for all trials in a set while the other model would predict different values. The pattern of

Table 1Mean and standard deviation (in parentheses) of the human judgments in the final iteration, separated by initial parameterization.

Chain Causal direction Small initial condition Large initial condition

w0 Gen. 39.95 (37.05) 59.19 (36.94)w0 Prev. 37.08 (36.71) 56.71 (37.06)

w1 Gen. 80.81 (32.73) 79.26 (33.80)w1 Prev. 79.56 (33.48) 80.11 (33.03)

Page 12: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

00.2

0.40.6

0.81

00.2

0.40.6

0.810

2

4

6

8

10

12

w0

Generative

w1

Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

2

4

6

8

10

12

w0

Preventive

w1Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

2

4

6

8

10

12

w0

Generative

w1

Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

2

4

6

8

10

12

w0

Preventive

w1

Probability

Fig. 4. Smoothed empirical estimates of human priors on causal strength produced by iterated learning in the top row. The SSpriors, shown previously in Fig. 2, are reproduced here in the bottom row for easier comparison.

12 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

human judgments, especially whether they are similar or different across these trials, can then be usedto lend support to either of the models under investigation. A similar approach has been used in manyother studies (e.g., Allan & Jenkins, 1983; Shanks, Lopez, Darby, & Dickinson, 1996).

While using contingencies that highlight the differences between predictions by different models isa good way to understand the kinds of effects these models can accommodate, this approach createsthe side effect of focusing on certain contingencies while overlooking others. This problem not onlyaffects individual experiments and the papers based on these experiments, but also presents a chal-lenge for meta-analyses of causal learning, as they use these papers as raw data. The coverage of con-tingencies in the meta-analysis conducted by Perales and Shanks (2007) provides a nice illustration ofthis unbalanced sampling. Their analysis included data from nine published papers, covering 19experiments and 114 contingencies. However, as different papers had different research objectivesand therefore used different contingencies to answer the respective research questions, simply aggre-gating these studies does not necessarily result in a balanced coverage of all possible contingencies.First, there were more contingencies for generative causes than preventive ones. If we consider both

Page 13: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 13

experimental instructions and ratio of contingency as cues to the causal direction, then there weretwice as many generative contingencies (60) as preventive ones (30) among the 114 data points (with20 underspecified and 4 undeterminable). More importantly, contingencies with certain characteris-tics were more frequently represented. In 45% of the contingencies either PðeþjcþÞ or Pðeþjc�Þ had thevalue of 0 or 1 (that is, the effect was always or never observed when the cause was either present orabsent), in 33% of them either PðeþjcþÞ or Pðeþjc�Þwas equal to 0.5 (the effect was observed in half thecases either when the cause was present or absent), and in 28% of them PðeþjcþÞ ¼ Pðeþjc�Þ (whetherthe cause was present or not did not change the percentage of effect). While the 114 contingenciesoverall did cover the entire contingency space reasonably well, the coverage was highly skewedtowards contingencies with these specific characteristics. Perales and Shanks (2007) themselves com-mented on this issue, noting that ‘‘researchers tend to design and run new experiments to distinguishbetween theories (often only between two theories), focusing on the parts of the experimental spacewhere their favored theory is a priori likely to be most successful’’ (Perales & Shanks, 2007, p. 577).

On the basis of these considerations, it seems that a broad set of stimuli that is agnostic to theresearchers’ hypotheses is needed in order to more comprehensively evaluate and compare the good-ness of the predictions by various models. Therefore in Experiment 2 we carried out such an experi-ment, collecting the largest and most systematic dataset on human causal induction to date. We thenused the results to compare the performance of the Bayesian models using different priors.

5. Experiment 2: A benchmark data set for causal strength judgments

Experiment 2 was designed to collect a benchmark data set for evaluating different models. Toavoid the sampling bias discussed above, we conducted a large-scale study in which we collectedhuman judgments of causal strength over the entire space of plausible contingencies, adopting anexpanded version of a design previously used by Buehner et al. (2003). We aimed to include a broadrepresentation of different contingencies by covering the entire range of values of PðeþjcþÞ andPðeþjc�Þ uniformly.

Another objective of Experiment 2 was to verify whether the empirical priors can be applied to con-tingencies of different sample sizes from those used in deriving the priors. For the empirical priors tobe useful, they need to capture abstract expectations about causal relationships so that they can beapplied outside of the specific setting in which they were derived. In Experiment 1, both NðcþÞ andNðc�Þ were fixed at 16. Thus in Experiment 2, we used NðcþÞ and Nðc�Þ of values 8, 16, and 32, andsystematically crossed them to produce more varied contingencies.

5.1. Methods

5.1.1. ParticipantsSimilar to Experiment 1, participants were recruited from both the University of California, Berke-

ley subject pool, and from Amazon Mechanical Turk. Participants from the subject pool receivedcourse credit and MTurk participants received a payment of US$1. Only data from participants whocompleted at least 95% of the trials were included in the analysis. Subjects were randomly assignedto either the generative or the preventive condition. In each of the generative or preventive condition,there were 36 participants from the university subject pool and 180 from MTurk, for a total of 216 ineach condition (and a grand total of 432). The results from the two samples were similar to each other,and were thus combined as in Experiment 1.

5.1.2. StimuliThe stimuli used in Experiment 2 were similar to those in Experiment 1. Participants were given

two samples—indicating the values of Nðeþ; cþÞ and NðcþÞ in one, and Nðeþ; c�Þ and Nðc�Þ in theother—and were asked to make predictions about another sample. In Experiment 1, sample sizes,NðcþÞ and Nðc�Þ, were fixed to 16, whereas in Experiment 2, sample sizes of 8, 16, and 32 were crossedwith each other, forming 3� 3 ¼ 9 sample size conditions. We will denote these sample size conditionsby hNðcþÞ;Nðc�Þi. For example h8;16i represents a set of stimuli in which NðcþÞ ¼ 8 and Nðc�Þ ¼ 16.

Page 14: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

14 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

In each sample size condition, contingencies were parameterized so that they are uniformly dis-tributed in the contingency space. More specifically, for both the cþ and c� samples, we systematicallyvaried the number of cases in which the effect is present (Nðeþ; cþÞ and Nðeþ; c�Þ). There were nine lev-els of Nðeþ; cþÞ and Nðeþ; c�Þ in each sample size condition, with NðeþÞ going from 0% to 100% of thesample size in increments of 12.5%. For a sample size of 8, the nine levels included every integer from0 to 8. For sample sizes of 16 and 32, the stimuli were in increments of 2 and 4 respectively. That is, forsample size of 16, values of 0;2;4; . . . ;14;16 were used; for sample size of 32, values of0;4;8; . . . ;28;32 were used. Note that the number of cases in which the effect is absent is just thecomplement of the above, as Nðe�; cþÞ ¼ NðcþÞ � Nðeþ; cþÞ and Nðe�; c�Þ ¼ Nðc�Þ � Nðeþ; c�Þ.

Some contingencies are a priori unlikely to be observed. For example, in the generative case, a con-

tingency of fNðeþ ;cþÞNðcþÞ ;

Nðeþ ;c�ÞNðc�Þ g ¼ f 3

16 ;1016g is very unlikely to lead to an inference of C having any causal

influence because PðeþjcþÞ is much smaller than Pðeþjc�Þ. To address this, the generative condition

included only contingencies in which Nðeþ ;cþÞNðcþÞ P Nðeþ ;c�Þ

Nðc�Þ , i.e., PðeþjcþÞP Pðeþjc�Þ. Similarly, the preven-

tive condition included only contingencies in which Nðeþ ;c�ÞNðc�Þ P Nðeþ ;cþÞ

NðcþÞ , i.e., Pðeþjc�ÞP PðeþjcþÞ.To give a concrete example, consider the sample size condition of h8;16i in the generative case. NðcþÞ

varied from 0;1;2; . . . ;8, while Nðc�Þ varied from 0;2;4; . . . ;16. As only contingencies in whichNðeþ ;cþÞ

NðcþÞ P Nðeþ ;c�ÞNðc�Þ were included, the contingencies were f0

8 ;0

16g; f18 ;

016g; f1

8 ;2

16g; f28 ;

016g; f2

8 ;2

16g; f28 ;

416g; f3

8; 016g;...;f

88;

1616g

. Because there were nine levels in both Nðeþ; cþÞ and Nðeþ; c�Þ, there were

1þ 2þ . . .þ 8þ 9 ¼ 45 possible contingencies in each sample size condition. Multiplying that withthe nine sample size conditions in each causal direction, the total number of contingencies was45� 9 ¼ 405, in each of the generative and preventive direction.

Each participant was assigned to either the generative or preventive condition randomly. To pre-vent fatigue or boredom on the part of the participants from affecting experimental results, and toensure that equal numbers of judgments were elicited from all possible contingencies, sets of 405 tri-als was divided among a set of nine participants in which each participant was assigned 45 trials.Within each set of nine participants, the contingencies were stratified so that each participant weregiven the same number of stimuli from each sample size condition, i.e., five from each ofh8;8i; h8;16i; h8;32i; h16;8i, etc. Except for the stratification, stimulus assignment and order were ran-domized. As mentioned, there were 216 participants in each causal direction, resulting in 216=9 ¼ 24sets of participants, and therefore 24 data points at each contingency.

5.1.3. ProcedureThe procedure was the same as that of Experiment 1 with the following exceptions. To focus on the

judgment of the strength of the candidate cause, we removed the question concerning the backgroundcause. This resulted in only one prediction task in each trial. Additionally, the default location of theslider was set to 0 in the generative case and 100 in the preventive case (as opposed to having randominitial positions as in Experiment 1).

5.2. Results

This experiment was designed to collect 24 human judgments on each of the 405 different contin-gencies from a total of 216 participants (36 from subject pool and 180 from MTurk), in each causaldirection. One human judgment was missing in the generative condition due to a computer error.As a result, 9719 and 9720 human judgments were recorded in the generative and the preventive caserespectively.

We computed the predictions of the three Bayesian models for each set of contingencies, and thencompared them to people’s judgments. Model predictions were computed by approximating the pos-terior distribution over a grid of uniformly spaced samples of w0 and w1. We first computed the priorprobability at each grid point, and then multiplied them using weights proportional to the likelihood.Finally, we summed out w0 and took the posterior mean of w1 (Eq. (5)) as each model’s prediction ofhuman judgments (Lu et al., 2008). Except for the priors, the rest of the computations were the same

Page 15: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

Fig. 5. Human judgments in the h8;8i sample size condition compared against model predictions using the SS and empiricalpriors. The generative case is plotted in the upper left and the preventive case in the lower right. Human judgments are plottedas a histogram in gray, broken into 10 bins; predictions based on the SS and empirical priors are plotted in dotted and solid linesrespectively. Contingencies that did not appear in the experiment (PðeþjcþÞ < Pðeþjc�Þ if generative and PðeþjcþÞ > Pðeþjc�Þ ifpreventive) are not shown.

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 15

across the three Bayesian models. Details of how these quantities were computed are provided inAppendix A.

We use one of the sample size conditions to illustrate a typical pattern of the results. Fig. 5 con-trasted subjects’ judgments of the strength of C against predictions by the SS and empirical modelsin both the generative and the preventive case at the h8;8i sample size condition. Two properties ofthe Bayesian models are made apparent based on the visual inspection of this figure. First, both theSS model and the empirical model make predictions that match well with the human responses. Sec-ond, the predictions made by the two Bayesian models are not radically different, as can be seen bycomparing the density curves for the SS and empirical models. This is not surprising as these Bayesianmodels differ only in their priors. In particular, at contingencies in which Nðeþ; cþÞ / NðcþÞ is close to 0/8 or 8/8, human judgments are strongly guided by their observations and therefore the two models’predictions were quite similar. However, the models do differ in the predictions they make for caseswith more mixed outcomes, in which the data did not provide strong evidence about the causal rela-tionship. This can be seen in the middle rows (when Nðeþ; cþÞ / NðcþÞ is close to 4/8) where the modelpredictions are more different from each other.7

We used four metrics to evaluate the fit between model predictions and human judgments. Themetrics were Pearson’s correlation coefficient (r), Spearman’s rank-order correlation coefficient (q),root-mean-square deviations (RMSD), and a normalized version of RMSD (normalized RMSD). The cor-relations r and q compare the mean human judgments with model predictions and indicate how wellthey correlate with each other. These two metrics focus on the relative values of human judgmentsand model predictions, and do not account for the absolute difference in values. RMSD takes intoaccount the absolute differences between judgments and predictions and is calculated as

7 Wefound t

performed a post hoc analysis comparing the performance of the empirical prior for different sample size conditions, andhat there is no significant difference.

Page 16: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

Table 2Comparison of model performance based on Experiment 2.

Causal direction Metric Uniform SS Empirical

Generative r (Pearson) 0.88 0.67 0.90Generative q (Spearman) 0.90 0.67 0.92Generative RMSD 15.77 24.91 13.49Generative Norm. RMSD 12.06 17.79 10.30

Preventive r (Pearson) 0.87 0.87 0.91Preventive q (Spearman) 0.88 0.89 0.92Preventive RMSD 17.62 18.34 13.49Preventive Norm. RMSD 16.62 17.03 12.40

Combined RMSD 16.69 21.63 13.49Combined Norm. RMSD 14.34 17.41 11.35

Note: Higher values for r and q, and lower values for RMSD and normalized RMSD indicate better performance. The empiricalmodel had the best performance based on all metrics.

8 In ttherefosmalles

9 Theperformprior.

16 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

RMSD ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXi

ðxi � x̂iÞ2

n

sð8Þ

where subscript i indexes different contingencies, xi represents the means of the human responses, x̂i

the model predictions, and n the number of such trials in the experiment, which is 24 in allcontingencies.

While the standard RMSD weighs all trials equally, normalized RMSD adjusts for the variance of thehuman responses—contingencies with less variability among the responses indicate higher agreementand less uncertainty among the participants about their responses, and as a result, are weighted moreheavily. The normalized RMSD is calculated as

normalized RMSD ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXi

ðxi � x̂iÞ2

s:e:i

,Xj

1s:e:j

vuut ð9Þ

where s:e:i is the standard error of the responses at contingencies i.8

We analyzed the performance of the three Bayesian models using data from Experiment 2. Theresults are shown in Table 2. Among the three Bayesian models, all four metrics mostly agreed witheach other. The correlations were the highest for the empirical model, in both the generative and pre-ventive conditions. The results were similar for the RMSD-based metrics, with the empirical modeloutperforming the other two. This suggests that the empirical model does a better job of capturingthe participants’ expectations about the strength of causal relationships than either the uniform orSS models, in terms of both the relative and the absolute values of the judgments.9

To verify these results statistically, we performed an analysis using the Mann–Whitney U test toverify whether predictions made by the empirical model are indeed closer to the human responsesthan those made by the uniform and the SS models. We used the absolute distance between meanhuman judgments and the predictions by each model at each contingency as a data point (810 for bothcausal directions). As expected, the mean distance from the human responses for the empirical model(M ¼ 10:10) was significantly lower than those of the uniform model (M ¼ 12:92;U¼ 279;729; p < 0:001) and the SS model (M ¼ 15:89;U ¼ 263;436; p < 0:001). These results con-firmed that the empirical model in fact made better predictions of the human responses than thetwo other models.

hree of the contingencies (out of 810), all (24) participants responded with the same prediction (0 in all three cases), andre resulted in a zero in one of the denominators in Eq. (9) (s:e:j). In these cases, we replaced the zeroes with the nextt possible standard error, using the standard error that would have been produced from 23 responses of 0 and one 1.SS model incorporates a free parameter, a, that was fixed at 5 (Lu et al., 2008). We performed a grid search for the best-ing a, and found that the SS model performed best at a ¼ 0:5, at which value the SS prior is very similar to the uniform

Page 17: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 17

5.3. Discussion

In Experiment 2, we collected a large data set of human judgments in elemental causal induction,using stimuli that evenly cover all plausible contingencies. We then compared the performance of thethree Bayesian models. Results showed that the empirical model, based on the priors estimated inExperiment 1 using the iterated learning technique, predicts human behavior better than other exist-ing Bayesian models. This suggests that the empirical model represents the closest approximation ofpeople’s computations in causal induction. Furthermore, as Bayesian models rely on priors that accu-rately represent the reasoners’ prior beliefs, the result suggests that the empirical priors best capturesparticipants’ expectations about the strength of causal relationships.

So far, both the estimation of the empirical prior and the evaluation of the resulting model used thesame simple biological cover story. Having shown that the empirically estimated prior gives good pre-dictions of causal strength judgments, a natural question to ask is how general we should expect thisprior to be—whether it generalizes beyond the biological setting. We turn to this question in Experi-ment 3.

6. Experiment 3: Expectations about different causal systems

The empirical prior from Experiment 1 was estimated using a biological cover story concerningproteins and gene expression. As people’s prior beliefs are connected to their prior knowledge aboutspecific causal systems, it might be possible that the empirical prior estimated in Experiment 1 mayalso be specific to this cover story and might not correspond to people’s prior beliefs about other cau-sal systems. Therefore in Experiment 3, we varied the cover stories used in the priors estimation pro-cedure in order to examine in what ways are people’s prior beliefs concerning different causal systemssimilar or different.

6.1. Methods

6.1.1. ParticipantsParticipants were 360 MTurk workers, with 90 participants in each condition. Participants received

a payment of US$1.

6.1.2. ProcedureThe basic experimental design was based on that of Experiment 1, with the major difference being

the cover story. We used four cover stories that represent prior knowledge from four diverse domains:physical sciences, medical reasoning, social behavior, and paranormal phenomenon. Instructions in thephysical condition were based on the ‘‘blicket detector’’ experiments that have been used to explorecausal learning in children (e.g., Gopnik, Sobel, Schulz, & Glymour, 2001; Schulz, Gopnik, & Glymour,2007; Sobel, Yoachim, Gopnik, Meltzo, & Blumenthal, 2007). In particular, we used a ‘‘super lead detec-tor’’ cover story that has been used to make this paradigm more natural for adult participants(Griffiths, Sobel, Tenenbaum, & Gopnik, 2011). The instructions read:

In this experiment, please imagine that you are working for a pencil company and you are studyingthe relationship between a material called ‘‘super lead’’ and machines called ‘‘super lead detectors’’.Pencil lead is made of carbon. Your company recently discovered that a new production processwas resulting in a new carbon structure in their pencils—what they call ‘‘super lead’’. Since theyare not sure which pencils they previously manufactured contain super lead, they are building aset of machines in order to detect it. These machines are programmed with different parametersto detect different types of carbon structures. You will be testing machines that are set up with dif-ferent parameters.There are a number of trials in this experiment. Each trial involves a different type of super lead,and a super lead detector programmed with a different parameter set. You will see some informa-tion about how often the machine indicates the presence of super lead with a set of pencils that donot contain super lead, and how often with a set of pencils that do contain a particular type of superlead. You will be then asked to make some predictions based on these pieces of information.

Page 18: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

18 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

In the medical condition, the cover story concerned the strength of allergy medicines in causing dif-ferent types of hormonal imbalance. In the social condition, different kinds of music were played todifferent breeds of dogs, the inference concerned to what degree do these music cause the dogs towag their tails. In the paranormal condition, participants read that a number of psychics used theirpower in trying to cause molecules to emit photons, and therefore the candidate cause and effect werethe psychics’ power and photon emission, respectively. Detailed instructions from the medical, social,and paranormal conditions are presented in Appendix B.

These four conditions were intended to represent causal domains about which people have differ-ent subjective beliefs. Interactions in the physical domain are often deterministic. For example, mov-ing a magnet near objects made of iron will invariably cause an attraction force between them.Everyday life social interactions, on the other hand, are more probabilistic, since we often cannotbe sure of the responses of others to our actions. The medical condition was designed to representa causal system with elements from both the physical and the social conditions, whereas the paranor-mal condition was created to introduce a causal system in which participants had the least experience.We expected people’s prior beliefs about these different causal systems to be reflected in the resultedprior distributions, so that we would be able to evaluate the similarities and differences among par-ticipants’ prior beliefs about various causal systems. While the forms of two-dimensional probabilitydistributions might be difficult to predict, we hypothesized that the participants hold the most deter-ministic expectations about the causal systems in the physical condition, followed by the medical, thesocial, and then the paranormal condition.

The rest of the experimental procedure was largely the same as Experiment 1. Like Experiment 1,each trial had a w0 and a w1 task, with a total of 48 trials—4 chains of 12 within-subjects iterations foreach participant. Also similarly, the initial values of (w0;w1) in the four chains wereð0:3;0:3Þ; ð0:3;0:7Þ; ð0:7;0:3Þ, and ð0:7; 0:7Þ. The iterated learning process was the same as in Experi-ment 1.

Part of the instructions in Experiment 3 was changed compared to Experiment 1 in order to addressa potential issue. In order to elicit people’s prior beliefs about a certain causal system, it is importantthat participants consider the entity in each trial as a different type under an umbrella category (samecausal system). If the participants regarded each entity as the same type, then it is possible that theirinference might rely on data from previous trials and as a result, the Markov assumption would beviolated. Here we used explicit instructions to eliminate this potential issue. The instructions hereemphasized that the causal relationship considered at each trial to be independent from other tri-als—participants were reminded in the instructions three times that both the candidate cause andeffect in each trial represent different types (e.g., in each trial, a different breed of dogs listened toa different kind of music). Thus participants are reminded that while the type is different at each trial,the categories (e.g., dog in general, music in general) remain consistent. Therefore, at convergence, thehypothesis-data pair should represent the participant’s prior beliefs about the causal relationship atthe category level.

Beyond the cover stories and the reminders about the distinct types, there were a few differencesfrom Experiment 1. First, we were mainly interested in how the priors associated with each cover storywere similar or different with each other. Consequently the differences between the generative andpreventive versions were not the main focus of this experiment, and therefore only the generative ver-sion was carried out. Second, in Experiment 3 we added three questions during the instruction phase totest the understanding of the participants about the instructions of the experiment, as advocated byCrump et al. (2013). These three questions were all multiple-choice questions and participants hadto answer correctly before continuing. If participants chose the wrong answer they would be told thattheir answer was wrong and were allowed to try again until the right answer was selected.

6.2. Results

We estimated the prior distributions in the same manner as in Experiment 1, by pooling all partic-ipants’ responses in the last iteration and then carrying out a kernel density estimation using a bivar-iate normal kernel. Similar to Experiment 1, the w0 chains in the four conditions did not converge,

Page 19: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 19

whereas the w1 chains in all four conditions converged quickly (by the third iteration in allconditions).

The estimated priors for the four conditions are shown in Fig. 6. Certain similarities and differencescan be observed based on visual inspection. Similar to the results from Experiment 1, prior probabilityin all four conditions is quite high in regions associated with high w1. This pattern is particularlyextreme in the physical prior distribution. Therefore this pattern suggests that expectations of strongcandidate cause might be a widely-held belief, regardless of the specific causal domain.

The medical and social priors are visually similar to each other and to the biological prior estimatedin Experiment 1. This might indicate that the participants have somewhat similar prior beliefs aboutthe causal strengths in these conditions. However, as we will see later, statistical tests will show thatthese two prior distributions to be distinct. While the paranormal prior is closer to the medical and thesocial priors than to the physical prior, it has a more pronounced high-density area at the center. Thismay reflect the fact that the participants were unsure about the causal relationships and respondedusing the middle of the scale.

00.2

0.40.6

0.81

00.2

0.40.6

0.810

5

10

15

20

25

30

35

40

w0

Physical

w1

Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

1

2

3

4

5

6

7

8

w0

Medical

w1

Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

1

2

3

4

5

6

7

w0

Social

w1

Probability

00.2

0.40.6

0.81

00.2

0.40.6

0.810

0.51

1.52

2.53

3.54

4.5

w0

Psychic

w1

Probability

Fig. 6. Smoothed empirical estimates of the human priors for the four conditions in Experiment 3.

Page 20: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

20 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

To formally compare the priors estimated in Experiment 3, we conducted pairwise comparisonsusing the t-test and the Kolmogorov–Smirnov test. Both tests attempt to identify differences betweenpairs of distributions, with the t-test focusing on the means and the Kolmogorov–Smirnov test focus-ing on non-parametric distributional differences. As there were four chains per participant and 90 par-ticipants per condition, there were 360 data points in each condition. The results are shown in Table 3,separately for w0 and w1. None of the differences in the tests about w0—six t-tests and six Kolmogo-rov–Smirnov tests—were statistically significant. This suggests that the participants’ expectationsabout the background cause were not different across different conditions. In contrast, for w1, all testsexcept one (pairwise t-test comparison between the medical and social condition) were statisticallysignificant, suggesting that participants’ expectations about the strength of causes in these differentscenarios were different.

We also compared these priors in terms of the degree of determinism, as we hypothesized thatpeople should expect outcomes from these causal systems to be deterministic to different degrees.We expected the physical prior to represent the most deterministic beliefs, and visual inspection sup-ports this expectation—the difference in prior probability density is most extreme in the physicalprior. We then tested this formally by operationalizing the degree of determinism as the Euclideandistance between each of the responses in the last iteration and the closest corner. As the four cor-ners—ð0;0Þ; ð0;1Þ; ð1;0Þ, and ð1;1Þ—represent deterministic relationships, a lower distance representsmore deterministic beliefs. The lowest possible value (and highest determinism) is therefore 0,whereas the highest possible value is

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi502 � 2

p¼ 70:72.

The degree of determinism was indeed in the order expected—the condition with the lowest dis-tance (thus highest deterministic beliefs) was the physical condition (16.98), followed by the medicalcondition (19.27), then the social condition (22.23), with the paranormal condition being the highest(23.64). The differences were significant between all but the neighbors in this rank ordering (Table 3).

In all four conditions, participants favor a strong candidate cause (w1), and this pattern is particu-larly strong for the physical condition. This suggests that participants expected that if the super lead

Table 3Comparison of priors estimated in Experiment 3 based on the t-test, the Kolmogorov–Smirnov test, and degree of determinism.

Medical Social Paranormal

t-testw0

Physical 1.335 (0.18) 0.013 (0.99) 0.446 (0.66)Medical 1.359 (0.18) 0.943 (0.35)Social 0.445 (0.66)

w1

Physical 5.818 (< 0:01) 5.632 (< 0:01) 8.517 (< 0:01)Medical 0.546 (0.59) 2.456 (0.01)Social 3.117 (< 0:01)

Kolmogorov–Smirnov testw0

Physical 0.075 (0.26) 0.058 (0.57) 0.089 (0.12)Medical 0.083 (0.16) 0.089 (0.12)Social 0.047 (0.82)

w1

Physical 0.158 (< 0:01) 0.258 (< 0:01) 0.297 (< 0:01)Medical 0.111 (0.02) 0.164 (< 0:01)Social 0.108 (0.03)

Degree of determinismPhysical 59992.0 (0.08) 54926.5 (< 0:01) 52686.5 (< 0:01)Medical 59616.5 (0.06) 57403.5 (< 0:01)Social 62569.5 (0.42)

Note: The figures indicate the respective statistics with p-values in parentheses.

Page 21: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 21

was present, it would almost always activate the super lead detector. At the same time, participantsdid not expect the background cause to be strong, as can be seen by the less extreme differences indensity along the w0 axis. This result can be taken as an example of how people combined a determin-istic causal relationship with a probabilistic one.

Despite the differences among them, the shapes of all four priors are at variance with the idea thatpeople have sparse and strong priors. More specifically, none of the prior distributions has peaks at bothð0;1Þ and ð1;0Þ as predicted by the SS theory concerning a generative cause. Furthermore, the SS theorypredicts a low density area around ð1;1Þ—people do not expect that both the background and the can-didate causes to be strong—but all four priors estimated in Experiment 3 have high density at ð1;1Þ.Overall, the results do not corroborate the SS theory, similar to what we found in Experiment 1.

We next computed the predictions of the Bayesian models based on these priors and evaluatedtheir performance using the data set obtained in Experiment 2. The results are shown in Table 4.Because the human responses from Experiment 2 were elicited using the same cover story as Exper-iment 1, we expected the model based on the biological prior estimated in Experiment 1 to have thebest performance. In contrast to this expectation, the biological model did not yield the best perfor-mance. However, as all empirical models outperformed both non-empirical models, it suggests thatthe general shape shared by all empirical priors—belief in a strong candidate cause—might capturean important characteristic of human causal reasoning.

To further investigate the differences in performance among the various models, we performed abootstrapping analysis using all seven Bayesian models—uniform, SS, biological (Experiment 1), phys-ical, medical, social, and paranormal (Experiment 3). In each bootstrap run, we sampled with replace-ment the participants in the Experiment 2 data set and calculated the normalized RMSD based on thesample. This procedure was repeated 1000 times for each prior. The results are shown in Fig. 7. Theresults support what we found previously—models using any of the five priors estimated via iteratedlearning outperformed Bayesian models with non-empirical priors. This suggests that the generalshape of the prior distributions estimated in Experiments 1 and 3 (e.g., expectation of strong candidatecause) might capture the most important features of people’s prior beliefs, with the variations acrosscover stories playing a less important role. We then performed a pairwise t-test between these sevenmodels. All results were significant with p < 0:001. This is not surprising as there were 1000 simula-tion runs for each model, and therefore even small differences were statistically significant.

6.3. Discussion

We estimated participants’ prior beliefs about different causal systems using a diverse set of coverstories. We found that while these priors can be separated quantitatively, some major features areshared between them. The most noteworthy of these features can be characterized as an expectationthat the candidate cause should be strong. In spite of the diversity of the cover stories and the corre-sponding causal domains, none of the prior distributions estimated in Experiment 3 demonstrates fea-tures predicted by the sparse and strong theory. We also found that participants expected physicalcausal systems to be the most deterministic, followed by medical, social, and paranormal ones.

We compared these models in terms of how well they predict human judgments in the benchmarkdata set obtained in Experiment 2. Although there were differences in performance among the empir-ical models, they all outperformed other non-empirical Bayesian models, suggesting that the generalshape of the empirical priors might be the key to understanding human causal induction.

Table 4Performance of models based on various empirical priors estimated in Experiments 1 and 3 with respect to human data collected inExperiment 2.

Metric Biological (Expt. 1) Physical (Expt. 3) Medical (Expt. 3) Social (Expt. 3) Paranormal (Expt. 3)

r (Pearson) 0.90 0.91 0.89 0.93 0.93q (Spearman) 0.92 0.91 0.91 0.94 0.95RMSD 13.49 13.88 15.26 10.58 11.38Norm. RMSD 10.30 9.91 11.63 8.50 9.14

Page 22: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

0.0

0.2

0.4

0.6

0.8

10 15 20

Normalized RMSD

Den

sity

SocialPsychicPhysicalBiological(Expt. 1)MedicalUniformSS

Prior

Fig. 7. Density plot of the performance of the various Bayesian models in the bootstrap analysis. Each curve represent theperformance, measured in normalized RMSD, of a model based on 1000 runs. The legend is listed in the order of performance(better performance is listed first). Hence the first model listed in the legend (social) corresponds to the left-most density curve,and so on.

22 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

The family resemblance among these priors suggests that their common features might reflect ageneral expectation about the nature of causal relationships—a general causal belief that people wouldfall back to in the absence of strong a priori subjective beliefs concerning the specific causal systembeing considered. More specifically, it suggests that, in many scenarios people might expect candidatecauses to be strong. Finally, these results suggest that beliefs about causal relationships might bemaintained at multiple levels of abstraction—domain-general theories that prescribe the rough shapeof the prior distribution and domain-specific hypotheses about particular causal systems that controlthe finer details.

7. General discussion

In Experiment 1, we presented an experiment using the iterated learning technique to empiricallyestimate people’s expectations about the causal relationships in an elemental causal induction task.These expectations were operationalized as prior probability distributions over the strengths ofcauses. We found the resulting empirical priors to be markedly different from what have previouslybeen proposed.

In Experiment 2, we obtained a large-scale data set that could be used as a benchmark for evalu-ating the performance of different models. To address the issue of selective coverage of contingenciesseen in previous studies, we used a set of stimuli that uniformly covered the space of all plausible con-tingencies. We compared various Bayesian models of causal learning against the human responses andfound that the best performance was produced by the Bayesian model using the empirical priorsfound in Experiment 1. This suggests that the empirical priors best capture the participants’ expecta-tions about causal relationships.

Experiment 3 used the iterated learning procedure to estimate participants’ expectations about thecausal relationships in a diverse set of domains. The results show that the prior distributions esti-mated using different cover stories share some major features (e.g., expectation of strong candidatecause) while having minor differences—suggesting that the participants have differentiated expecta-tions while at the same time maintaining certain common theories about the nature of causal relation-ships. Moreover, we found that Bayesian models with empirical priors, regardless of the cover storiesthrough which the priors were obtained, outperformed Bayesian models with non-empirical priors.This suggests that the major features captured in the empirical priors are able to explain a significantshare of how people make causal induction.

Page 23: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 23

In the remainder of the paper we discuss some of the issues raised by the work presented here. Wewill consider some of the limitations of our method, identify possible directions for future research,and highlight some of the conclusions that can be drawn from our results.

7.1. Implications for models of causal induction

The implications of our results are most direct for Bayesian models of causal induction: our analysisof models based on different prior distributions suggests that future work can be benefited from usinga prior distribution similar to the empirical distributions we estimated in these experiments. Ourresults showed that good predictions could be obtained even using a prior distribution estimated froma different domain, so we hope that these priors will have some generality even as researchers exploredifferent cover stories for experiments on causal induction.

This paper focuses on comparing Bayesian models with different prior distributions and has omit-ted comparison with non-Bayesian models (e.g., DP, causal power, EI rule, etc.). These non-Bayesianmodels do not have analogs of prior distributions, and our methodology was defined on the assump-tion that it was reasonable to model people’s judgments in terms of Bayesian inference. However, thefinding that people tend to assume causes are strong may nonetheless be useful for these other mod-els. For example, associative models of causal learning (e.g., Shanks, 1995) need to make assumptionsabout the default or initial strengths of causes, and people’s expectations might be relevant to settingthese model parameters. Similarly, the findings here can also contribute to Bayesian models that donot use the noisy-OR/noisy-AND-NOT functional forms (Anderson, 1990; Anderson, 1991), or modelsthat approximate Bayesian computations (Beam & Miyamoto, 2013).

The forms of the empirical priors were somewhat unexpected, as both the generative and preven-tive cases were significantly different from those previously proposed (Griffiths & Tenenbaum, 2005;Lu et al., 2008). However, they are consistent with previous work that suggested people have a pref-erence for deterministic causal relationships (Frosch & Johnson-Laird, 2011; Lucas & Griffiths, 2010;Schulz & Sommerville, 2006). This finding provides a new link between formal models of causal induc-tion from contingency data and the broader literature on causal learning.

7.2. Limitations and future directions

Our results provide a clear characterization of our participants’ expectations about the strength ofcausal relationships, and lead to more accurate predictions of human judgments of causal strength.However, there remain a number of important issues that need to be explored in future work.

7.2.1. Origins of priors on causal strengthOn identifying the structure of our participants’ expectations about causal relationships, a natural

question is how those expectations were formed. Following (Anderson, 1991), we might expect thatpeople’s priors reflect an adaptive response to their learning environment. In general, if people’s priorscapture the statistical regularities in their environment, such prior beliefs will allow them to quicklylearn about the relationships (Lu et al., 2008). This adaptive proposal, however, raises many ques-tions—how people evaluate data in their environment, how people reason about such data in theireveryday lives, whether there are kinds of causal relationships that are considered more important,etc.—that we have only just begun to explore. The first step in evaluating these questions is to developa more comprehensive picture of the contexts in which people acquire knowledge about causal rela-tionships and their properties.

While the current paper does not attempt to answer what kind of mechanisms might have givenrise to such priors, researchers have explored this issue both theoretically and empirically. For exam-ple, hierarchical Bayesian models have been suggested as a computational level explanation of howpeople might acquire causal knowledge (Kemp, Perfors, & Tenenbaum, 2007; Lucas & Griffiths,2010; Tenenbaum et al., 2006). Combining advanced modeling techniques and broad data sets, suchas those collected in the current paper, is likely to be the key to further developments in our under-standing of the acquisition of this kind of knowledge.

Page 24: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

24 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

7.2.2. Individual and cultural differencesAn assumption of the iterated learning technique is that people share the same prior beliefs with

respect to any particular causal system. While our results suggest that this is a reasonable approxima-tion to the truth, insofar as we are able to obtain good predictions of aggregate behavior using theprior estimated by our method, it is undoubtedly a false assumption. Although labels like ‘‘people’spriors’’ are used in this paper, the priors obtained in the current research certainly do not representthe expectations of every individual in the world. We expect that there will be differences betweenindividuals and between cultural groups in their expectations about causal relationships. The methodsthat we have used here could be adapted to explore these patterns of variations.

While there has not been much cross-cultural research studying causal induction, numerous pre-vious studies have demonstrated significant cultural differences in causal attribution (Choi, Nisbett, &Norenzayan, 1999; Morris & Peng, 1994; Nisbett, Peng, Choi, & Norenzayan, 2001). Because causalattribution is based upon causal knowledge gained previously from a lifetime of experience, it is pos-sible that cultural differences in causal attribution might have originated from cultural differences incausal learning. Having established iterated learning to be a useful tool for investigating subjectivebeliefs on causal strength, we anticipate that this approach can be used to explore whether andhow priors in causal learning vary across cultures.

7.2.3. Domain specificity of people’s expectationsMany previous research studies on causal induction have used stimuli in the biological domain,

because it provides plausible causal relationships between variables, and the functional form of theserelationships is simple. Indeed, findings based on particular sets of stimuli have often been generalizedbroadly. However, as we demonstrated in Experiment 3, people’s expectations about different causalsystems can be fairly different. It is therefore prudent to reevaluate the conclusions drawn from pre-vious findings and carefully consider the extent to which they can be generalized.

Moreover, a few points can be made with respect to interpreting the current results and integratingthem with previous findings. First, the prevalence of studies with similar cover stories and formalstructure means that the findings in this paper will be relevant to the ongoing research in causalinduction. Second, as we have argued, the estimated prior distributions based on the several differentcover stories in Experiment 3 shared a number of common features. Therefore our results and analysesprovide a case study that we expect to have broad applicability. Finally, since we have developed thetools to estimate people’s prior beliefs with respect to any causal systems, a similar approach can beapplied in other domains to obtain a clear picture of how people make causal inferences in these sit-uations. Indeed, the variation in the nature of the prior knowledge that informs causal induction indifferent domains is one of the many attractive attributes that make it a rich setting in which toexplore the nature of human inductive inference (Griffiths & Tenenbaum, 2009).

7.2.4. Complexity of causal systemsAn additional limitation of the current research is that, by focusing on the simplest possible causal

structure, our experiments did not address how people reason about causal systems in which morethan one candidate cause is present. Some recent studies have examined how people estimate thestrength of multiple causes (e.g., Novick & Cheng, 2004). Adapting the iterated learning procedureto these studies might provide a way to investigate people’s prior beliefs about complex causal rela-tionships and how people reason about them.

One particularly promising application of this approach is to understand people’s expectationsabout the functional form of causal relationships. Lucas and Griffiths (2010) recently showed that peo-ple can learn different schemes for combining the strength of causes through a relatively smallamount of experience with a novel causal system. This analysis assumed a particular prior distributionon functional forms—that people favor disjunctive over conjunctive causes—and iterated learningcould be used to evaluate this assumption.

7.2.5. Evaluating models of causal inductionOur results also raise questions concerning how performance of causal induction models should be

measured and compared. Most previous work on computational models of causal induction compared

Page 25: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 25

model predictions with human responses at specific sets of contingencies. These sets of contingencieswere usually chosen by researchers to highlight the different predictions made by the models beingcompared. We have discussed the issue concerning the distribution of contingencies used in previousstudies and the need for an unbiased data set to objectively compare these models. In Experiment 2,we obtained human judgments for a wide range of stimuli that uniformly cover the space of all plau-sible contingencies. We see our results as providing a new benchmark against which models of causalinduction can be compared. However, it could be argued that the best way to evaluate models of cau-sal induction is to compare them based on stimuli that accurately capture the frequencies of instancesof causal induction that people face in their everyday life. Developing a way to assess the natural sta-tistics of causal relationships will an interesting direction for future research.

7.3. Conclusion

Expectations about causal strength play a key role in causal induction: they tell us what magnitudeof causal influence people are looking for when they are trying to identify causal relationships. Accu-rately characterizing these expectations is a key problem for Bayesian models of causal induction. Sincethese models typically share the same assumptions about the hypotheses that people are evaluating(which are encapsulated in the causal graphical models with particular parameterizations), prior dis-tributions are the only differentiating factor. Using iterated learning to estimate prior distributions overcausal strength reveals that both the default assumption of uniform priors and the claim that peopleexpect causes to be ‘‘sparse and strong’’ might be incorrect. Our results provide an unusually clear pic-ture of people’s intuitions about causal relationships, and reveal that people expect causes to have ahigh probability of producing an effect, regardless of the rate at which that effect tends to occur.

In addition to providing this picture of people’s expectations, we have collected the largest andmost comprehensive dataset of human judgments of causal strength based on contingency data. Thisdataset provides an important resource for future work on causal induction, providing a means ofevaluating models of causal induction without concerns about biased sampling in the space of contin-gencies. Using this dataset, we have shown that the priors estimated through our experiments resultin parameter-free Bayesian models of causal induction that outperform previous models in accountingfor people’s judgments.

Beyond causal induction, we see these results as illustrating how two of the key challenges of cog-nitive psychology can be addressed through innovative methodologies. First, the psychological con-struct of interest in an area of research, such as the prior distribution on causal strengths studied inthis paper, might not be quantities amenable to direct measurements. Traditionally, psychological the-ories and related computational models are often justified based on how well human behavior can bepredicted by model output. We have presented an alternative to this approach: directly estimating thepsychological construct of interest using the technique of iterated learning. Second, researchers oftentend to focus on thin slices of experimental data as support for different computational models. Thesedata might have been chosen because they demonstrate interesting phenomena, or because theyallow for easier comparison between models. However, the coverage of stimuli in these studies isoften quite limited. In the present paper we tackled this issue by using the larger number of partici-pants available online to test a broad range of stimuli, in a way that is agnostic to specific hypotheses.We see this combination of new methods for measuring psychological constructs and new ways ofexperimental design as opening a new pathway towards resolving some of the fundamental questionsof cognitive psychology.

Acknowledgments

A previous version of this work was presented at the 33rd Annual Conference of the Cognitive Sci-ence Society. We thank Doug Medin, Mike Oaksford, David Shanks, Tom Beckers, Michael Ranney, andthree anonymous reviewers for helpful comments. We also thank Hongjing Lu for providing the codefor the sparse and strong model, and data from the paper. This work was supported by grants FA9550-13-1-0170 and FA-9550-10-1-0232 from the Air Force Office of Scientific Research, a grant from theMcDonnell Causal Collaborative, Basic Research Funds from Beijing Institute of Technology, a grant

Page 26: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

26 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

from Cher Wang and Wenchi Chen of VIA Technologies, National Natural Science Foundation of China(Program Code: 71373020), Beijing Program of Philosophy and Social Science (Program Code:13JYB010), and the Learning Science and Educational Development Laboratory at the Institute of Edu-cation, Beijing Institute of Technology.

Appendix A. Computing causal strength estimates

We computed the predictions made by different Bayesian models used in this paper based on theprocedures used by Griffiths and Tenenbaum (2005) and Lu et al. (2008). The predicted causal strengthis calculated as the mean of the posterior, which is in turn based on the prior distribution of pðw0;w1Þand the likelihood pðdjw0;w1Þ. We will cover the computation of the likelihood and prior terms inorder.

The likelihood term is shared among all Bayesian models and is based on the noisy-OR or noisy-AND-NOT parameterizations (Pearl, 2009). Given a generative cause, and w0 and w1, the probabilityof observing effect e in one case is given by the noisy-OR version:

pðeþjb; c; w0;w1Þ ¼ 1� ð1�w0Þbð1�w1Þc ð10Þ

in which b and c are Boolean variables representing the presence or absence of the background and thecandidate cause respectively. Note that in the current model the background cause is always present.Therefore the probability of observing e with c present is w0 þw1 �w0w1, whereas the probability ofobserving e with c absent is w0.

For the preventive case, the probability is given by the noisy-AND-NOT version:

pðeþjb; c; w0;w1Þ ¼ wb0ð1�w1Þc: ð11Þ

On observing contingency data d, which consists of the values of Nðe; cÞ, the full likelihood in thegenerative case is

pðdjw0;w1Þ ¼Nðc�Þ

Nðeþ; c�Þ

!NðcþÞ

Nðeþ; cþÞ

!ð12Þ

�wNðeþ ;c�Þ0 ð1�w0ÞNðe

� ;c�Þ

� ðw0 þw1 �w0w1ÞNðeþ ;cþÞð1� ðw0 þw1 �w0w1ÞÞNðe

� ;cþÞ

while the full likelihood in the preventive case can be found by replacing the individual likelihoodterms with ones from Eq. (11).

The prior term is formulated using the joint probability of w0 and w1, and is the sole differencebetween various Bayesian models. In the uniform model, pðw0;w1Þ ¼ 1, representing the same priorprobability no matter what values w0 and w1 take. In the SS model, the unnormalized prior probabilityis computed using Eqs. (6) and (7). In the empirical model, the prior probability is based on the resultsof Experiment 1. We took the final iteration of the human judgments and smoothed these values viakernel density estimation with a bivariate normal kernel (Venables & Ripley, 2002). The results areplotted in Fig. 4. An alternative way to compute the model predictions would be to use the humanresponses directly as samples from the prior, obtaining a Monte Carlo approximation to the posteriorvia importance sampling (e.g., Neal, 1993).

The posterior is found by multiplying the prior and the likelihood:

pðw0;w1jdÞ / pðw0;w1Þpðdjw0;w1Þ: ð13Þ

Posterior for either of the causal strength parameters can be found by integrating it over the value ofthe other. To convert this posterior into a point value, we first normalize the posterior and then takethe expected value by integrating it over the desired wi. For example, for w1, the model prediction is:

�w1 ¼Z 1

0w1pðw1jdÞdw1: ð14Þ

Page 27: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 27

In this paper the mean posteriors for all three Bayesian models were computed numerically. Wefirst set up a grid of 51� 51 points in the w0 �w1 space and computed the prior density for each gridpoint. We then multiplied the prior with the likelihood at each grid point and normalized it to obtainthe posterior. The values along the w0 axis on this 2-dimensional grid were summed out to find theposterior for w1. Finally the mean of the posterior w1 was taken to be the model prediction.

Appendix B. Experiment 3 instructions

The instructions for the medical, social, and paranormal conditions were as follows:

B.1. Medical

In this experiment, please imagine that you are a researcher working for a medical company andyou are studying the relationship between some allergy medicines and hormonal imbalance as a sideeffect of these medicines.

Your company recently discovered that a new production process was resulting in changes in themolecular structures in the allergy medicines, and these new medicines cause abnormal levels of hor-mones in people. Since they are not sure which medicines they previously manufactured might causeanomalies in which type of hormone, you are tasked with investigating this.

There are a number of trials in this experiment and each trial involves a different type of medicineand a different hormone. You will see some information about how often people who don’t take themedicine have a particular kind of hormonal imbalance, and how often people who take that medicinehave the same kind of hormonal imbalance. You will be then asked to make some predictions based onthese pieces of information.

B.2. Social

In this experiment, please imagine that you are an animal researcher and you are studying the rela-tionship between music and the tail-wagging behavior of different dog breeds.

You have found that some dogs would wag their tails after listening to some kinds of music. Sinceyou are not sure what kind of music might cause which breed of dog to wag their tails, you havedecided to investigate this.

There are a number of trials in this experiment and each trial involves a different kind of music anda different breed of dogs. For each kind of music, you will see some information about how often dogswho were not played the music wagged their tails, and how often dogs who were played the musicwagged their tails. You will be then asked to make some predictions based on these pieces ofinformation.

B.3. Paranormal

In this study, please imagine that you are a physics researcher and you are studying the relation-ship between psychic power and the behavior of molecules.

All molecules that you are currently investigating share a characteristic in that they all emit pho-tons at random intervals, but at different rates. A number of psychics have claimed that they can makethese molecules emit photons within a minute of when they use their power. You are tasked withinvestigating this.

There are a number of trials in this study and each trial involves a different psychic and a differenttype of molecule. For each psychic, you will see some information about how many molecules haveemitted photons when a particular psychic was simply standing next to the molecules, and how manyof them have emitted photons following when psychic used his/her power. You will be then asked tomake some predictions based on these pieces of information.

Page 28: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

28 S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29

References

Ahn, W.-k., Kalish, C. W., Medin, D. L., & Gelman, S. A. (1995). The role of covariation versus mechanism information in causalattribution. Cognition, 54(3), 299–352.

Allan, L. G., & Jenkins, H. M. (1983). The effect of representations of binary variables on judgment of influence. Learning andMotivation, 14(4), 381–405.

Anderson, J. R. (1990). The adaptive character of thought. Hillsdale, NJ: Erlbaum.Anderson, J. R. (1991). The adaptive nature of human categorization. Psychological Review, 98(3), 409–429.Beam, C. S., & Miyamoto, J. M. (2013). A quasi-Bayesian cascaded inference model of causal power. Poster session presented at the

annual meeting of the Psychonomic Society, Toronto, ON, Canada.Buehner, M. J., Cheng, P. W., & Clifford, D. (2003). From covariation to causation: A test of the assumption of causal power.

Journal of Experimental Psychology: Learning, Memory, and Cognition, 29(6), 1119–1140.Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s mechanical turk: A new source of inexpensive, yet high-quality,

data? Perspectives on Psychological Science, 6(1), 3–5.Chater, N., & Vitányi, P. (2003). Simplicity: A unifying principle in cognitive science? Trends in Cognitive Sciences, 7(1), 19–22.Cheng, P. W. (1997). From covariation to causation: A causal power theory. Psychological Review, 104(2), 367–405.Choi, I., Nisbett, R. E., & Norenzayan, A. (1999). Causal attribution across cultures: Variation and universality. Psychological

Bulletin, 125(1), 47–63.Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s mechanical turk as a tool for experimental

behavioral research. PloS ONE, 8(3), e57410.Dennis, M. J., & Ahn, W.-K. (2001). Primacy in causal strength judgments: The effect of initial evidence for generative versus

inhibitory relationships. Memory & Cognition, 29(1), 152–164.Edwards, A., Elwyn, G., & Mulley, A. (2002). Explaining risks: Turning numerical data into meaningful pictures. BMJ, 324(7341),

827–830.Frosch, C. A., & Johnson-Laird, P. N. (2011). Is everyday causation deterministic or probabilistic? Acta Psychologica, 137, 280–291.Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. Suffolk, UK: Chapman & Hall/CRC.Gopnik, A., Sobel, D. M., Schulz, L. E., & Glymour, C. (2001). Causal learning mechanisms in very young children: Two-, three-,

and four-year-olds infer causal relations from patterns of variation and covariation. Developmental Psychology, 37(5),620–629.

Griffiths, T. L., Christian, B. R., & Kalish, M. L. (2008). Using category structures to test iterated learning as a method foridentifying inductive biases. Cognitive Science, 32(1), 68–107.

Griffiths, T. L., & Kalish, M. L. (2007). Language evolution by iterated learning with Bayesian agents. Cognitive Science, 31(3),441–480.

Griffiths, T. L., Sobel, D. M., Tenenbaum, J. B., & Gopnik, A. (2011). Bayes and blickets: Effects of knowledge on causal induction inchildren and adults. Cognitive Science, 35(8), 1407–1455.

Griffiths, T. L., & Tenenbaum, J. B. (2005). Structure and strength in causal induction. Cognitive Psychology, 51(4), 334–384.Griffiths, T. L., & Tenenbaum, J. B. (2009). Theory-based causal induction. Psychological Review, 116(4), 661–716.Jaynes, E. T., & Bretthorst, G. L. (2003). Probability theory: The logic of science. West Nyack, NY: Cambridge University Press.Kalish, M. L., Griffiths, T. L., & Lewandowsky, S. (2007). Iterated learning: Intergenerational knowledge transmission reveals

inductive biases. Psychonomic Bulletin & Review, 14(2), 288–294.Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). Learning overhypotheses with hierarchical bayesian models. Developmental

Science, 10(3), 307–321.Kirby, S. (2001). Spontaneous evolution of linguistic structure: An iterated learning model of the emergence of regularity and

irregularity. IEEE Journal of Evolutionary Computation, 5, 102–110.Lober, K., & Shanks, D. R. (2000). Is causal induction based on causal power? Critique of Cheng (1997). Psychological Review,

107(1), 195–212.Lucas, C. G., & Griffiths, T. L. (2010). Learning the form of causal relationships using hierarchical Bayesian models. Cognitive

Science, 34(1), 113–147.Lu, H., Yuille, A. L., Liljeholm, M., Cheng, P. W., & Holyoak, K. J. (2008). Bayesian generic priors for causal learning. Psychological

Review, 115(4), 955–984.Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San

Francisco, CA: WH Freeman.Mason, W., & Suri, S. (2012). Conducting behavioral research on amazon’s mechanical turk. Behavior Research Methods, 44(1),

1–23.Morris, M. W., & Peng, K. (1994). Culture and cause: American and Chinese attributions for social and physical events. Journal of

Personality and Social Psychology, 67(6), 949–971.Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Technical Report CRG-TR-93-1). Toronto, ON:

University of Toronto.Newsome, G. L. (2003). The debate between current versions of covariation and mechanism approaches to causal inference.

Philosophical Psychology, 16(1), 87–107.Nisbett, R. E., Peng, K., Choi, I., & Norenzayan, A. (2001). Culture and systems of thought: Holistic versus analytic cognition.

Psychological Review, 108(2), 291–310.Novick, L. R., & Cheng, P. W. (2004). Assessing interactive causal influence. Psychological Review, 111(2), 455–485.Pearl, J. (2009). Causality: Models, reasoning and inference (2nd ed.). New York, NY: Cambridge University Press.Perales, J. C., & Shanks, D. R. (2007). Models of covariation-based causal judgment: A review and synthesis. Psychonomic Bulletin

& Review, 14(4), 577–596.Ross, J., Irani, L., Silberman, M., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in

mechanical turk. In Proceedings of the 28th of the international conference extended abstracts on human factors in computingsystems (pp. 2863–2872). New York, NY: ACM.

Page 29: Identifying expectations about the strength of causal ... · causal relationship. More recently, researchers have proposed various models based on Bayesian statistics. While these

S. Yeung, T.L. Griffiths / Cognitive Psychology 76 (2015) 1–29 29

Schulz, L. E., Gopnik, A., & Glymour, C. (2007). Preschool children learn about causal structure from conditional interventions.Developmental Science, 10(3), 322–332.

Schulz, L. E., & Sommerville, J. (2006). God does not play dice: Causal determinism and children’s inferences about unobservedcauses. Child Development, 77(2), 427–442.

Shanks, D. R. (1995). The psychology of associative learning. Cambridge, MA: Cambridge University Press.Shanks, D. R., Lopez, F. J., Darby, R. J., & Dickinson, A. (1996). Distinguishing associative and probabilistic contrast theories of

human contingency judgment. Psychology of Learning and Motivation, 34, 265–311.Sobel, D. M., Yoachim, C. M., Gopnik, A., Meltzo, A. N., & Blumenthal, E. J. (2007). The blicket within: Preschoolers’ inferences

about insides and causes. Journal of Cognition and Development, 8(2), 159–182.Spirtes, P., Glymour, C. N., & Scheines, R. (2001). Causation, prediction, and search. Cambridge, MA: The MIT Press.Sprouse, J. (2011). A validation of Amazon mechanical turk for the collection of acceptability judgments in linguistic theory.

Behavior Research Methods, 43(1), 1–13.Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. (2006). Theory-based Bayesian models of inductive learning and reasoning. Trends in

Cognitive Sciences, 10(7), 309–318.Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman, N. D. (2011). How to grow a mind: Statistics, structure, and abstraction.

Science, 331(6022), 1279–1285.Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S. New York, NY: Springer-Verlag.Ward, W. C., & Jenkins, H. M. (1965). The display of information and the judgment of contingency. Canadian Journal of

Psychology, 19(3), 231–241.Williams, D. A., & Docking, G. L. (1995). Associative and normative accounts of negative transfer. The Quarterly Journal of

Experimental Psychology, 48(4), 976–998.Xu, J., Dowman, M., & Griffiths, T. L. (2013). Cultural transmission results in convergence towards colour term universals.

Proceedings of the Royal Society B: Biological Sciences, 280(1758).Xu, J., & Griffiths, T. L. (2010). A rational analysis of the effects of memory biases on serial reproduction. Cognitive Psychology,

60(2), 107–126.