Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

28
Midterm 0 1 2 3 4 5 6 7 C ount 0 10 20 30 40 50 60 70 mean = 38.9 ± 12.9 >50 = A >40 = B >29 = C

Transcript of Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Page 1: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Midterm

0

1

2

3

4

5

6

7

Cou

nt

0 10 20 30 40 50 60 70

mean = 38.9 ± 12.9

>50 = A>40 = B>29 = C

Page 2: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Meta-Analysis

An increasing trend is for ecological and evolutionary studies that synthesize the large body of existing data, and look for overall trends across species and ecosystems to identify general principles

- termed “meta-analysis”, such studies do not generate data but rather analyze patterns across hundreds of pre-existing datasets

While such studies can be very powerful, they also may suffer from a crucial underlying problem: treating species as independent data-points

Page 3: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Meta-Analysis

Standard frequentist statistics rely on some basic underlying assumptions

- one critical assumption is that all observations (data points) are independent, and contribute equally to sample size

In turn, sample size influences the statistical significance, or odds of seeing an apparent trend when there isn’t really one present in the data

P < 0.05 means, less that 5% chance of seeing this trend by chance alone, when no trend is actually present

- i.e., if you had a huge number of samples, the “trend” would disappear

Page 4: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

Statistical analyses generally produce three things:

a) the model: an equation describing the relationship among variables

- the model has some parameters, which are constants estimated by the analysis; in this example, parameters are the slope and intercept

model

parameters

Page 5: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

Statistical analyses generally produce three things:

a) the model: an equation describing the relationship among variables

- the model has some parameters, which are constants estimated by the analysis; in this example, parameters are the slope and intercept

error term

parameters

- also contains an error term for the proportion of variance unexplained by the modeled relationship, but let’s ignore that for now

+ ε

Page 6: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

Statistical analyses generally produce three things:

a) the model: an equation describing the relationship among variables

b) a measure of goodness-of-fit: how well does the model fit the data, or describe the relationship

c) a significance level of the model fit

model

goodness-of-fit: a measure of how well the model explains a pattern or trend in the data

Page 7: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

Statistical analyses are sensitive to sample size: the higher the sample size, the more significant a P-value will be, given the same basic goodness-of-fit

the critical underlying assumption here is that each point represents a truly independent measure of the relationship; “duplicate” observations artifically inflate the sample size

Page 8: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

- suppose you wanted to analyze the relationship between brain size and population size in vertebrates, to test whether smaller populations select for larger brains due to increased competition for resources (or whatever reason)

n = 17

P < 0.0001

Page 9: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

n = 17P < 0.0001

fish

birdsprimates

What if... 1) most fish are just plentiful and dumb;

2) most birds have medium-sized populations and sing to communicate, so have evolved larger brains;

3) primates live in small tribes due to ecological constraints, and have coincidentally evolved advanced communication

What is your sample size here,

REALLY?

Page 10: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Statistical modelling or analysis of data

n = 17P < 0.0001

fish

birdsprimates

If all primates correspond to one evolutionary “sample”, and within that group population size and brain size covary because that’s the ancestral condition in primates...

... then your actual sample size may be n = 3, not n = 17

n = 3P > 0.1

Page 11: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Exploring relationships among traits

This statistical issue is critical to a large field of biology that seeks to indentify relationships among traits or characteristics shared by species

For instance...

- do organisms lose eyes as a result of living in caves?- do flying clades contain more species than non-flying clades?- do promiscuous species have smaller brains?- do self-polinating plant species go extinct at higher rates?

To test these hypotheses, we need to test for a significant correlation between traits among species

Page 12: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Species are not independent observations

species are not independent data points, because they share common ancestry with close relatives

- related species may be similar because they all evolved from a common ancestor with a certain combination of traits or features

brain pop.size size

- two totally different possibilities:

1) population size affects the evolution of brain size

2) these traits have no effect on each other’s evolution, but have simply been co-inherited by chance

Page 13: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Species are not independent observations

species are not independent data points, because they share common ancestry with close relatives

- related species may be similar because they all evolved from a common ancestor with a certain combination of traits or features

brain pop.size size

are these 8 independent datapoints, or one evolutionary event that produced an apparent association between brain and population size?

Page 14: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Exploring relationships among traits

This statistical issue is critical to a large field of biology that seeks to indentify relationships among traits or characteristics shared by species

This issue, like almost everything, was first raised by Darwin:

“We may often falsely attribute to correlated variation structures which are common to whole groups of species, and which in truthare simply due to inheritance; for an ancient progenitor may have acquired through natural selection some one modification in structure, and, after thousands of generations, some other and independent modification; and these two modifications, having been transmitted to a whole group of descendants with diverse habits, would naturally be thought to be in some necessarymanner correlated.” (Darwin, 1872; 6th ed. of Origin of Species)

Page 15: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Comparative methods

Attention was brought to this issue in modern times by Felsenstein (1985) and Harvey & Pagel (1991)

Felsenstein J. 1985. Phylogenies and the comparative method. Am.Nat.125:1–15.

Harvey P.H., Pagel M.D. 1991. The comparative method in evolutionarybiology. Oxford: Oxford University Press.

Nevertheless, researchers continue to ignore this issue to an alarming degree

Ecologists in particular often ignore this issue in meta-analyses because it requires an estimate of phylogeny that may not be available for any given set of species

Page 16: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

the similarity among species due to shared ancestry is called phylogenetic effects

- all else being equal, closely related species are expected to resemble each other in suites of traits, which may therefore appear to be evolving in a correlated manner

traits 1 2 1 2

- correlated trait evolution - apparent correlation due to phylogenetic effects

Page 17: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Any character (= trait) can have two or more possible states, or ways that trait can be expressed

- the simplest states are present / absent

- some traits are binary, meaning they only have two possible character states (big/small, smart/dumb, slutty/not slutty)

- others are multi-state, meaning have >2 possible states:

For instance, the character “trophic mode” could have states: 1) herbivore; 2) carnivore; 3) autotrophic (photosynthetic)

The character “habitat” could have states: 1) in trees; 2) under rocks; 3) inside human host

Characters and character states

Page 18: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

how 2 traits are distributedtoday, in 8 living species

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1

2

last commonancestor of 8 species

= brain size (small, ; big, )

= population (small, ; big, )

Page 19: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

how 2 traits are distributedtoday, in 8 living species

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

last commonancestor of 8 species

If a change in trait 1 is always followed by a change in trait 2, then the two traits are evolving in a correlated manner

Page 20: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Comparative methods test for trends among species

When one trait is repeatedly associated with change in another trait, they are evolving in a correlated manner

- comparative methods can test for, and identify, when this has happened by correcting for phylogenetic effects

model the evolution of traits across a phylogeny

- change in traits is more likely to occur on longer branches (= more time for evolution to happen)

- thus, change is less likely to happen by chance between closely related species (they are separated by short branch lengths) change likely, somewhere along here

change unlikely on short branches

Page 21: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Comparative methods 1: Build a phylogeny

GATC

A G

GGTC

GGTC

GGTC

GGTC

GGTT

GATC

GATC

GATC

AATC

GATC

C T

G A

1.0

0.01

20 350.02

0.07

- this is a model of DNA sequence evolution

- the mutation rates are the parameters of the model (variables you estimate)

- based on DNA sequences of living species, we infer the most likely sequence of mutations that produced the observed data

Page 22: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Comparative methods 2: Model trait evolution

0.5

1.0

- model of character trait evolution

- parameters of the model are the transition rates between character states (ways the trait can appear)

state 1, small state 2, big

states shown for nodes are the most likely ancestral state for that clade, based on our model

Page 23: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Comparative methods 2: Model trait evolution

0.5

1.0

- if the model is forced to allow a character change on a short branch, it is penalized (lower L score)

- same way that a likelihood phylogenetic analysis will penalize you for assuming a rare transversion happenedstates shown for nodes are the

most likely ancestral state for that clade, based on our model

Page 24: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Comparative methods 3: Test hypotheses

ancestral character state reconstruction is the process of inferring what the state was for a trait across the nodes on a phylogeny

- model can infer the relative probability of two or more possible states, plotted as a pie diagram

at some nodes, probability of different ancestral states may be roughly equal, in which case we cannot necessarily infer what the ancestor looked like without a more formal test

we can be pretty confident the ancestor of these two species had state

Page 25: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

To test alternative hypotheses, force the model to make a given node have one state; then estimate overall likelihood score (L, a measure of goodness-of-fit, like r2)

1 2

log(L1) = -7 log(L2) = -4

L score indicates the likelihood of observing your trait data given (a) the phylogeny, and (b) the model of trait evolution with the estimated parameter values

Page 26: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

1 2

log(L1) = -7 log(L2) = -4

- L scores are reported as log(likelihood), or negative exponents, because L scores are usually tiny fractions

L = 0.0000001 = 10-7 log(L) = -7 L = 0.0001 = 10-4 log(L) = -4

note: a larger negative log(L) is a smaller likelihood, & thus a worse fit to the data

worse fit better fit

Page 27: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

The relative likelihood of two alternative models can be compared using the likelihood ratio:

likelihood of data given Hypothesis 2 = L2

likelihood of data given Hypothesis 1 L1

1 2

log(L1) = -7 log(L2) = -4

Page 28: Midterm mean = 38.9 ± 12.9 >50 = A >40 = B>29 = C.

Whether one model is a significantly better fit to the data can be determined using a Chi-Squared test, taking 2 x (difference in log(L) scores) as the test statistic

1 2

log(L1) = -7 log(L2) = -4

Χ2 = 2(logL2 – logL1) = 2[-4 – (-7)] = 6; P < 0.025

favorsthis

scenario