Machiel Mulder and Bert Scholtens Faculty of Economics and Business University of Groningen
Sequential Sampling Designs for Small-Scale Protein Interaction Experiments Denise Scholtens, Ph.D....
-
Upload
asher-mcdaniel -
Category
Documents
-
view
219 -
download
2
Transcript of Sequential Sampling Designs for Small-Scale Protein Interaction Experiments Denise Scholtens, Ph.D....
Sequential Sampling Designs for Small-Scale Protein
Interaction Experiments
Denise Scholtens, Ph.D.Associate Professor, Northwestern University, Chicago IL
Department of Preventive Medicine, Division of Biostatistics
Joint work with Bruce Spencer, Ph.D.
Professor, Northwestern University, Evanston ILDepartment of Statistics and Institute for Policy Research
Large Scale Protein Interaction Graphs
Fig. 4, Gavin et al. (2002) Nature
Top panelNodes: protein complex estimatesEdges: common members
Bottom panelNodes: proteinsEdges: complex co-membership (often called indirect interaction)
Often steady-state organisms E.g. Saccharomyces cerevisiae, various
interaction types Gavin et al. (2002, 2006) Nature, Ho et
al. (2002) Nature, Krogan et al. (2006) Nature, Ito et al. (1998) PNAS, Uetz et al. (2000) Nature, Tong et al. (2006) Science, Pan et al. (2006) Cell
Topology Modular organization into
complexes/groups Bader et al. (2003) BMC Bioinformatics,
Scholtens et al. (2005) Bioinformatics, Zhang et al. (2008) Bioinformatics, Qi et al. (2008) Bioinformatics
Global characterization as small-world, scale-free, hierarchical, etc. Watts and Strogatz (1998) Nature, Barabási
and Albert (1999) Science, Sales-Pardo et al. (2007) PNAS
Measurement Error False positive/negative probabilities
Chiang et al. (2007) Genome Biology, Chiang and Scholtens (2009) Nature Protocols
Mostly large graphs 100s-1000s of nodes 1000s-10,000s of edges
Sampled data
AP-MS data capture bait-prey relationships: a bait finds ‘interacting’ prey with common membership in at least one complex
bait
prey
One AP-MS `pull-down’
Three bait:prey pull-downs from Gavin et al. (2002)
Apl5: Apl6, Apm3, Aps3, Ckb1Apl6: Apl5, Apm3, Eno2Apm3: Apl6, Apm3
Apl6
Apl5
untested: ?
tested:absent
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
Maximal cliques map to protein complexes: when all proteins are used as baits, all nodes have edges to all other nodes in the clique, and the clique is not contained in any other clique
NOTE: Failure to test all edges means we typically cannot identify maximal cliques
Inference using a portion of possible baits
B
DF
A
C
AE
Two protein complexes with physical topologiesshown by black edges
6 Baits: ABCDEF
If the AP-MS technology works perfectly (I.e. no false positives or false negatives)…
B
D
A
F
C E
1 Bait: A
15 tested edges 9 present 6 absent
5 tested edges 5 present 0 absent 10 untested edges
B
D
A
F
C E
2 Baits: AB
9 tested edges 7 present 2 absent 6 untested edges
B
D
A
F
C E
3 Baits: ABC
12 tested edges 8 present 4 absent 3 untested edges
B
D
A
F
C E
Smaller-scale studies
What if we are interested only in a portion of the graph? Cataloguing complexes/
describing the local neighborhood for a pre-specified set of starting baits
Comparing local neighborhoods for different sample types disease vs. normal treated vs. untreated
Starting bait of interest
Interesting neighbor
Less interesting neighbor
Uninteresting neighbor
Link tracing designs(or snowball sampling)
Start with a set of nodes as starting baits (S0) Identify interacting partners Use interacting partners as new set of baits, excluding those already used as baits Identify their interacting partners Etc….
S0 S1 S2 S3
Link tracing notationAdapted from Handcock and Gile (2010) Annals of Applied Statistics
€
n number of nodes
Yij ,characteristic of a pair of nodes here a binary variable
Yij =1 if the edge between nodesi andj exists and0 otherwise
Y n×n matrix ofYij
Sm n×1 binary vector withi 1 th entry equal to if nodei is in them ( ) th wave of sampled nodes here bait nodes
0 . and otherwise LetS0 be the initial sample and in our case
.let it be defined a priori
Cm Cumulative set of sampled nodes afterm ,waves . . i e Stt=0
m∑
Link tracing notationAdapted from Handcock and Gile (2010) Annals of Applied Statistics
€
Dm n×n (matrix of tested edges in the graph withi, j) th element equal
1 to if thei / th and orj th element ofCm 1. equals Note:
Dm =Cm1T +1CmT−CmCm
T for an undirected graph
Dm =Cm1T for a directed graph
In traditional link- ,tracing designs we defineSm as follows
Sm = YSm−1 ⋅(1 −Cm−1 ) > 0[ ]
wherex[ ] 1 is ifx 0 .is true and otherwise
So all nodes with an edge to a node inSm−1 are included inSm , .unless they have been used in prior waves
Link tracing notation
€
,If we are satisfied with estimates of cliques with some untested edges .then define the following
Em n×1 binary vector for the set of nodes eligible to be
sampled on wavem (m=1,...,k)ψm n×1 vector of selection probabilities for units at wavem
,For the first wave we can write
E1 =[YS0 ⋅(1 −S0 ) > 0] and
(Pr S1 =s1 |E1 ,ψ1 ) = (ψ1ii=1
n∏ E1i )
s1 i ((1−ψ1i )E1i +(1−E1i ))
(1−s1 i)
fors1 ∈ {0,1}n.
Link tracing notation
€
,More generally for wavem (m=1,...,k)
Em =[YC m-1 ⋅(1 −Cm-1 ) > 0] and
(Pr Sm =sm |Em,ψm) = (ψmii=1
n∏ Emi)
smi ((1−ψmi)Emi+(1−Emi))
(1−smi)
forsm ∈ {0,1}n.
A simple scheme
Let ψm remain constant over all sampling waves, e.g. choose a fixed proportion p of all eligible baits at each wave.
This leads to a simplification in the probability of observing a specific sample. In particular,
Pr(Sm = sm | Em,ψm) = π (pEmi)smi((1-pEmi))(1-s
mi)
i=1
n
Sampling 1/4 of all eligible baits…
S0 = {n1,n2,n3}E1 = {n4,n6,n12,n13,n14,n15,n16,n17}S1 = {n4,n12}
E2 = {n6,n13,n14,n15,n16,n17,n34,n35, n36,n37,n38,n59,n97,n98,n99n100,n194}S2 = {n15,n59,n97,n98,n99}
Etc…
Note that we donot cover all portions of the graphthat we would witha full snowballsample.
Negative binomial In this setting, a path of length l extending from one of the starting
baits follows a negative binomial distribution for being tested (and therefore observed) in m rounds of sampling (0 < l ≤ m).
Pr(observing a path of length l in m rounds) = ( )pl(1-p)m-l
m=l,l+1,…
m-1l-1
p p p
p p p
1-p
Test all 3 nodes/edges in 3 rounds:
Test 3 nodes/edges in 4 rounds:
Cumulative probabilities The cumulative probability for observing paths with nodes that
are sampled early on is higher than those that enter later.
When nodes are tightly grouped in cliques, this can lead to over-sampling in regions of the graph with high-confidence clique estimates.
Ie, we may be ‘satisfied’ with a clique estimate that has a certain proportion of tested edges, but if the involved nodes are identified early in the process, chances are they will eventually enter the sample…so how can we move on and sample other areas?
There is also great dependency among joint probabilities of testing any pair (or larger collection) of paths, especially among nodes with common paths extending from the starting baits.
B
D
A
F
C E
6 Baits: ABCDEF
B
D
A
F
C E
1 Bait: A
15 tested edges 9 present 6 absent
5 tested edges
So 1/3 of possible edges are tested
B
D
A
F
C E
2 Baits: AB
9 tested edges 9/15 = 3/5 tested
B
D
A
F
C E
3 Baits: ABC
12 tested edges 12/15 = 4/5 tested
Tested fraction of edges
In addition, we are interested in complexes with a certain proportion of tested edges out of those that are possible, not necessarily a proportion of tested baits (although they are related)
Edge imputation Assume a simple edge imputation scheme in which untested edges are
assumed to exist if the involved prey share at least one common bait. This is consistent with high clustering coefficients observed for these types of
graphs as well as existing clique estimation algorithms on partially observed graphs.
A complex (or clique) estimate may be considered ‘high quality’ if more than half of the involved edges are tested and observed.
High Quality:9/15=0.6 edges observed
Low Quality:13/28=0.46 edges observed
Tested fraction of edges
In a collection of nodes involving b baits and q prey-only nodes with no measurement error for edge observations, we have:
b(b-1)/2 tested edges among baitsbq tested edges among bait-prey pairs(b+q)(b+q-1)/2 possible edges among all nodes
So then the proportion of observed edges is
b(b-1) + 2bq(b+q)(b+q-1)
A modification: capturing dependency among nodes
B
DF
A
C
AE
Two protein complexes with physical topologies
B
D
A
F
C E
CorrespondingAP-MS graph
c1
c2
A 1 1
B 1 0
C 0 1
D 1 0
E 0 1
F 1 0
A B C D E F
A 1 1 1 1 1 1
B 1 1 0 1 0 1
C 1 0 1 0 1 0
D 1 1 0 1 0 1
E 1 0 1 0 1 0
F 1 1 0 1 0 1
A = Y = AAT =
Boolean algebra:1+1=1*1=1+0=10+0=0*0=0*1=0
Affiliation matrix:nodes to cliques
Incidence matrixamong nodes
Strata:Nodes with identical adjacency
B
D
A
F
C E
AP-MS graph A B C D E F
A 1 1 1 1 1 1
B 1 1 0 1 0 1
C 1 0 1 0 1 0
D 1 1 0 1 0 1
E 1 0 1 0 1 0
F 1 1 0 1 0 1
Y =
All nodes with matching colors on the previous slide are connected to each other, and have matching sets of adjacent nodes
In some sense, they contain ‘redundant’ information And in a measurement error setting, extremely highly
correlated information
If we know the strata, and we know the set of adjacent nodes for one member node, then we know the set of adjacent nodes for all other strata constituents
For sampling purposes, it seems reasonable to represent these subpopulations by design
B
D
A
F
C E
AP-MS graph
BDF
A
CE
c1
c2
A 1 1
B 1 0
C 0 1
D 1 0
E 0 1
F 1 0
A B C D E F
A 1 1 1 1 1 1
B 1 1 0 1 0 1
C 1 0 1 0 1 0
D 1 1 0 1 0 1
E 1 0 1 0 1 0
F 1 1 0 1 0 1
A = Y = AAT =
Boolean algebra:1+1=1*1=1+0=10+0=0*0=0*1=0
g1
g2
g3
A 1 0 0
B 0 1 0
C 0 0 1
D 0 1 0
E 0 0 1
F 0 1 0
X =
Affiliation matrix:nodes to strata
c1 c2
g1 1 1
g2 1 0
g3 0 1
Q =
Affiliation matrix:strata to cliques
Stratified sampling
The idea: use estimated strata to inform sampling
Maintain a constant fraction of tested edges within each estimated strata
This will help identify strata and summarize their connectivity to other strata
It will also help focus our resources in areas that require more observations as opposed to those that have been adequately sampled according to some desired threshold for the fraction of tested edges
Stratified sampling
Testing at least half of the edges within a stratum with 10 member nodes:
At least 3 baits are required
Have 1 baitChoose 2 more baits
Have 2 baitsChoose 1 more bait
Have 4 baitsDon’t sample from this stratum(or do so with small probability)
Stratified sampling While the strata and the fraction of tested edges within
them determine the number of additional baits to include, the samples do also include observations of edges connecting pairs nodes in different strata
Tested edge within strata
Tested edge between strata
Stratified sampling Algorithm:
Specify starting baits S0 and form E1
Impute edges among prey-only nodes with at least one common bait
Estimate strata according to matching adjacency in Y1 to form X1
Calculate fraction of tested edges for each stratum determined by X1
Determine number of additional baits required for each stratum and sample accordingly to form S1
Repeat
At each step k, we can also estimate Qk, Yk and/or Ak
A comparison:Threshold sampling
Similar to the simple random sampling scheme introduced earlier
Rather than specifying a set proportion of baits to test, sample the appropriate number to test a certain fraction of all possible edges in the graph given the identified nodes
Simulation:In silico Interactome
We used the ScISI Bioconductor package to create an ‘in silico interactome’ containing protein complex data reported in the Cellular Component Gene Ontology and at MIPS for Saccharomyces cerevisiae.
The largest connected component of the resultant graph contains 1404 nodes and 86609 edges.
197 protein complexes are represented with a range of sizes from 2 to 308 (median 18).
Simulation Study Compared stratified(str) and threshold (thresh)
sampling schemes
Specified tested fractions of 1/10 and 1/20 of all possible edges
Called a complex ‘high quality’ if at least 1/2 of the edges were tested
For each iteration, randomly chose 3 nodes with close proximity as starting baits
250 rounds for each scheme
Discussion
Large-scale protein interaction experiments are very costly and may not be of interest in smaller lab settings or for investigations of particular cellular functions
As long as we are comfortable with some estimation of untested edges, sampling identified prey to create the next bait set may yield considerable savings
Discussion
Using estimated sampling strata seems to provide a greater balance of resource allocation across the graph
Work still in progress suggests that this is due to a reduction in cumulative sampling variability across the graph
As long as the per-bait cost is less than the per-sampling-round cost, stratified sampling appears to be a better approach
Extensions
Measurement error can be easily included in specification of Em, and adaptations of clique identification (e.g. the penalized likelihood method in Bioconductor’s apComplex) can be used instead of straightforward imputation
This would also be a natural starting point for adaptively designing experiments to compare different sample types