Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference...

6
Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1 , Jia J. Wu a,1 , and Neda Bagheri a,b,c,d,2 a Interdisciplinary Biological Sciences, Northwestern University, Evanston, IL 60208; b Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208; c Center for Synthetic Biology, Northwestern University, Evanston, IL 60208; and d Chemistry of Life Processes, Northwestern University, Evanston, IL 60208 Edited by Douglas A. Lauffenburger, Massachusetts Institute of Technology, Cambridge, MA, and accepted by Editorial Board Member James J. Collins January 9, 2018 (received for review June 16, 2017) Accurate inference of regulatory networks from experimental data facilitates the rapid characterization and understanding of biological systems. High-throughput technologies can provide a wealth of time-series data to better interrogate the complex regulatory dynamics inherent to organisms, but many network inference strategies do not effectively use temporal information. We address this limitation by introducing Sliding Window Infer- ence for Network Generation (SWING), a generalized framework that incorporates multivariate Granger causality to infer network structure from time-series data. SWING moves beyond existing Granger methods by generating windowed models that simulta- neously evaluate multiple upstream regulators at several poten- tial time delays. We demonstrate that SWING elucidates network structure with greater accuracy in both in silico and experimen- tally validated in vitro systems. We estimate the apparent time delays present in each system and demonstrate that SWING infers time-delayed, gene–gene interactions that are distinct from base- line methods. By providing a temporal framework to infer the underlying directed network topology, SWING generates testable hypotheses for gene–gene influences. gene regulatory networks | network inference | machine learning | Granger causality | time-series analysis E lucidating gene–gene regulation is a fundamental chal- lenge in molecular biology, and high-throughput technolo- gies continue to provide insight about the underlying organi- zation, or topology, of these interactions. Accurate network models representing genes (nodes) and regulatory interac- tions (edges) infer information from many observed heteroge- neous components while minimizing the effects of noise and hidden nodes. Many methods infer gene regulatory networks (GRNs) from expression profiles (1), but each suffers from limitations—assumptions of linearity, univariate comparisons, or computational complexity—and most ignore temporal informa- tion in time-series data. Understanding the temporal dynamics of gene/protein expression is critical to elucidating responses involved in cell cycle, circadian rhythms, DNA damage, and development (2–5). Existing methods to infer GRNs from time-series expression profiles include dynamical models, statistical approaches, and hybrids of the two (1, 6–8). Dynamical systems models of differ- ential equations can forecast future system behaviors and charac- terize formal properties such as stability (9), but these models are computationally intractable for large GRNs due to extensive and explicit parameterization requirements (10). Statistical inference methods—such as regression schemes, mutual information, deci- sion trees, and Bayesian probability (11–13)—make no explicit mechanistic assumptions and are often more computationally efficient than dynamical models. However, many implementa- tions of aforementioned algorithms treat time points as inde- pendent observations, disregarding time delays associated with transcription, translation, and other processes inherent to gene regulation (14, 15). Hybrid methods—such as SINDy and Jump3—use statistical methods to optimize the search and parameterization of dynamical models, but they remain compu- tationally expensive and rely on accurate specification of basis functions (16, 17). If the experimental sampling interval is less than or equal to the time delay between a regulator and its downstream target, it is possible to use Granger causality to incorporate intrinsic delays that are often hidden from measurement (18). Current implementations of Granger causal network inference methods are limited: The inference (i) is conducted pairwise, prohibit- ing simultaneous assessment of multiple upstream regulators; (ii) has a single user-defined delay, which assumes a uniform delay between all regulators and their targets; or (iii) requires each explanatory variable, assessed at multiple delays, to be selected as a group (19–23). Thus, their implementation has limited broad utility in biological systems with heterogeneous time delays. To allow for multiple time delays to affect downstream target nodes, we introduced an extensible framework to infer GRNs from time-series data, termed Sliding Window Inference for Net- work Generation (SWING). SWING embeds existing multivari- ate methods, both linear and nonlinear, into a Granger causal framework that concurrently considers multiple time delays to infer causal regulators for each node. SWING also uses sliding windows to create many sensitive, but noisy, inference models that are aggregated into a more stable and accurate network. We Significance Discovery of gene regulatory networks (GRNs) is crucial for gaining insights into biological processes involved in devel- opment or disease. Although time-resolved, high-through- put data are increasingly available, many algorithms do not account for temporal delays underlying regulatory systems—such as protein synthesis and posttranslational modifications—leading to inaccurate network inference. To overcome this challenge, we introduce Sliding Window Inference for Network Generation (SWING), which uniquely accounts for temporal information. We validate SWING in both in silico and in vitro experimental systems, highlighting improved performance in identifying time-delayed edges and illuminating network structure. SWING performance is robust to user-defined parameters, enabling identification of regula- tory mechanisms from time-series gene expression data. Author contributions: J.D.F., J.J.W., and N.B. designed research; J.D.F. and J.J.W. per- formed research; J.D.F. and J.J.W. analyzed data; and J.D.F., J.J.W., and N.B. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. D.A.L. is a guest editor invited by the Editorial Board. Published under the PNAS license. 1 J.D.F. and J.J.W. contributed equally to this work. 2 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1710936115/-/DCSupplemental. Published online February 12, 2018. 2252–2257 | PNAS | February 27, 2018 | vol. 115 | no. 9 www.pnas.org/cgi/doi/10.1073/pnas.1710936115 Downloaded by guest on January 18, 2021

Transcript of Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference...

Page 1: Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1, Jia J. Wu

Windowed Granger causal inference strategy improvesdiscovery of gene regulatory networksJustin D. Finklea,1, Jia J. Wua,1, and Neda Bagheria,b,c,d,2

aInterdisciplinary Biological Sciences, Northwestern University, Evanston, IL 60208; bDepartment of Chemical and Biological Engineering, NorthwesternUniversity, Evanston, IL 60208; cCenter for Synthetic Biology, Northwestern University, Evanston, IL 60208; and dChemistry of Life Processes, NorthwesternUniversity, Evanston, IL 60208

Edited by Douglas A. Lauffenburger, Massachusetts Institute of Technology, Cambridge, MA, and accepted by Editorial Board Member James J. CollinsJanuary 9, 2018 (received for review June 16, 2017)

Accurate inference of regulatory networks from experimentaldata facilitates the rapid characterization and understanding ofbiological systems. High-throughput technologies can provide awealth of time-series data to better interrogate the complexregulatory dynamics inherent to organisms, but many networkinference strategies do not effectively use temporal information.We address this limitation by introducing Sliding Window Infer-ence for Network Generation (SWING), a generalized frameworkthat incorporates multivariate Granger causality to infer networkstructure from time-series data. SWING moves beyond existingGranger methods by generating windowed models that simulta-neously evaluate multiple upstream regulators at several poten-tial time delays. We demonstrate that SWING elucidates networkstructure with greater accuracy in both in silico and experimen-tally validated in vitro systems. We estimate the apparent timedelays present in each system and demonstrate that SWING inferstime-delayed, gene–gene interactions that are distinct from base-line methods. By providing a temporal framework to infer theunderlying directed network topology, SWING generates testablehypotheses for gene–gene influences.

gene regulatory networks | network inference | machine learning |Granger causality | time-series analysis

E lucidating gene–gene regulation is a fundamental chal-lenge in molecular biology, and high-throughput technolo-

gies continue to provide insight about the underlying organi-zation, or topology, of these interactions. Accurate networkmodels representing genes (nodes) and regulatory interac-tions (edges) infer information from many observed heteroge-neous components while minimizing the effects of noise andhidden nodes. Many methods infer gene regulatory networks(GRNs) from expression profiles (1), but each suffers fromlimitations—assumptions of linearity, univariate comparisons, orcomputational complexity—and most ignore temporal informa-tion in time-series data. Understanding the temporal dynamicsof gene/protein expression is critical to elucidating responsesinvolved in cell cycle, circadian rhythms, DNA damage, anddevelopment (2–5).

Existing methods to infer GRNs from time-series expressionprofiles include dynamical models, statistical approaches, andhybrids of the two (1, 6–8). Dynamical systems models of differ-ential equations can forecast future system behaviors and charac-terize formal properties such as stability (9), but these models arecomputationally intractable for large GRNs due to extensive andexplicit parameterization requirements (10). Statistical inferencemethods—such as regression schemes, mutual information, deci-sion trees, and Bayesian probability (11–13)—make no explicitmechanistic assumptions and are often more computationallyefficient than dynamical models. However, many implementa-tions of aforementioned algorithms treat time points as inde-pendent observations, disregarding time delays associated withtranscription, translation, and other processes inherent to generegulation (14, 15). Hybrid methods—such as SINDy andJump3—use statistical methods to optimize the search and

parameterization of dynamical models, but they remain compu-tationally expensive and rely on accurate specification of basisfunctions (16, 17).

If the experimental sampling interval is less than or equal tothe time delay between a regulator and its downstream target,it is possible to use Granger causality to incorporate intrinsicdelays that are often hidden from measurement (18). Currentimplementations of Granger causal network inference methodsare limited: The inference (i) is conducted pairwise, prohibit-ing simultaneous assessment of multiple upstream regulators;(ii) has a single user-defined delay, which assumes a uniform delaybetween all regulators and their targets; or (iii) requires eachexplanatory variable, assessed at multiple delays, to be selectedas a group (19–23). Thus, their implementation has limited broadutility in biological systems with heterogeneous time delays.

To allow for multiple time delays to affect downstream targetnodes, we introduced an extensible framework to infer GRNsfrom time-series data, termed Sliding Window Inference for Net-work Generation (SWING). SWING embeds existing multivari-ate methods, both linear and nonlinear, into a Granger causalframework that concurrently considers multiple time delays toinfer causal regulators for each node. SWING also uses slidingwindows to create many sensitive, but noisy, inference modelsthat are aggregated into a more stable and accurate network. We

Significance

Discovery of gene regulatory networks (GRNs) is crucial forgaining insights into biological processes involved in devel-opment or disease. Although time-resolved, high-through-put data are increasingly available, many algorithms donot account for temporal delays underlying regulatorysystems—such as protein synthesis and posttranslationalmodifications—leading to inaccurate network inference. Toovercome this challenge, we introduce Sliding WindowInference for Network Generation (SWING), which uniquelyaccounts for temporal information. We validate SWING inboth in silico and in vitro experimental systems, highlightingimproved performance in identifying time-delayed edges andilluminating network structure. SWING performance is robustto user-defined parameters, enabling identification of regula-tory mechanisms from time-series gene expression data.

Author contributions: J.D.F., J.J.W., and N.B. designed research; J.D.F. and J.J.W. per-formed research; J.D.F. and J.J.W. analyzed data; and J.D.F., J.J.W., and N.B. wrote thepaper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. D.A.L. is a guest editor invited by the EditorialBoard.

Published under the PNAS license.1 J.D.F. and J.J.W. contributed equally to this work.2 To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1710936115/-/DCSupplemental.

Published online February 12, 2018.

2252–2257 | PNAS | February 27, 2018 | vol. 115 | no. 9 www.pnas.org/cgi/doi/10.1073/pnas.1710936115

Dow

nloa

ded

by g

uest

on

Janu

ary

18, 2

021

Page 2: Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1, Jia J. Wu

SYST

EMS

BIO

LOG

YBI

OPH

YSIC

SA

ND

COM

PUTA

TIO

NA

LBI

OLO

GY

validated the efficacy of SWING on several in silico time-seriesdatasets and existing in vitro datasets with corresponding goldstandard networks. We show that SWING reconstructs net-works more accurately than baseline methods and demonstratethat this performance boost is partly attributed to accuratelyinferring edges that involve an identifiable time delay betweenupstream regulators and targets. In validation studies analyzingnetworks derived from Escherichia coli and Saccharomyces cere-visiae, SWING inferred networks with distinct topologies and cantherefore be combined with other methods to improve consensusmodels. The SWING framework is available for use and can befound on GitHub (https://github.com/bagherilab/SWING).

ResultsSWING integrates multivariate Granger causality and ensem-ble learning to infer interactions from gene expression data.First, SWING subdivides time-series data into several tempo-rally spaced windows based on user-specified parameters (Fig.1A). For each window, edges are inferred from the selected win-dow and previous windows, representing interactions with spe-cific delays. This inference results in a ranked list of time-delayedgene–gene interactions for each window (Fig. 1B). The ensembleof models is aggregated based on edge rank into a static GRN(Fig. 1C). In silico and in vitro validation confirmed notable per-formance improvements.

SWING Improves the Inference of in Silico GRNs. We appliedSWING to reconstruct in silico GRNs simulated by GeneNet-Weaver (GNW) (24). A total of 20 subnetworks with 10 nodesand nonisomorphic topologies were extracted from E. coli andS. cerevisiae networks included in GNW to use as gold stan-dards. Networks were inferred from the generated time-seriesdata by using existing multivariate methods as a basis for compar-ison. We used RandomForest (RF), least absolute shrinkage andselection operator (LASSO), and partial least-squares regression(PLSR) (11, 12, 25), which represent the areas of sparse, non-

linear, and PLS-based regression. We implemented the SWINGchassis and compared the performance of each SWING front-line method with its base method: SWING-RF vs. RF, SWING-LASSO vs. LASSO, and SWING-PLSR vs. PLSR.

To capture short-term dynamics consistent with simulated per-turbations, we set the window size to approximately half theduration of the time series. The minimum and maximum lagswere set to kmin =1 and kmax =3, which correspond to 50 and100 min. We compared the group of inferred networks by cal-culating the mean increase in the area under the precision-recall (AUPR) and area under the receiver operating character-istic (AUROC) curves of 40 in silico networks. Compared withrespective baseline methods, SWING showed a statistically sig-nificant increase in AUROC and AUPR for many of the 10-node networks (Fig. 2A and SI Appendix, Table S1) and acrossall of the 100-node networks (SI Appendix, Fig. S1 and TableS1). In particular, RF received the most notable benefit fromSWING; SWING-RF outperformed RF in 39 out of 40 in siliconetworks, and application of SWING-RF resulted in the high-est mean AUROC and AUPR for in silico networks amongtested methods.

SWING Infers Distinct Edges in Networks. No single method per-forms optimally across all datasets, partially due to biasesin predicting different network topologies. For example, E.coli-derived networks predominately feature fan-out motifs,which RF infers with greater sensitivity. In contrast, S. cere-visiae-derived networks contain more cascade motifs, which areinferred with greater sensitivity by linear methods (14).

To determine if SWING methods provide distinct informationfrom RF, LASSO, and PLSR (R/L/P), we ran principal compo-nent analysis (PCA) on ranked edge lists predicted by SWINGand the corresponding base methods (Fig. 2B). We discardedPC1 because it largely explains the overall performance of eachinference method (58% variance explained; SI Appendix, Fig.S2). Clustering of results in PC2 and PC3 seemed to explain

A B C

Fig. 1. Overview of the SWING framework. (A) Time-series data are divided into windows with a user-specified width, w. (B) For each window, inferenceis performed by iteratively selecting response and explanatory genes. The subset of available explanatory genes is defined by the minimum and maximumuser-allowed time delays. (C) Edges from each window model are aggregated into a single network representation of the biological interactions betweenmeasured variables.

Finkle et al. PNAS | February 27, 2018 | vol. 115 | no. 9 | 2253

Dow

nloa

ded

by g

uest

on

Janu

ary

18, 2

021

Page 3: Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1, Jia J. Wu

*** *** *

*** ***

*** **

*** *

2

3

101112

7

98

45 6

1

A

B

Fig. 2. SWING improves inference of 10-node in silico networks.(A) Changes in AUPR and AUROC in GNW networks. Score changes to indi-vidual networks are shown in gray. The mean (red) and median (black) ofeach score distribution is shown. AUPR and AUROC increase when usingSWING-RF or -PLSR compared with their respective base method. SWING-LASSO outperforms LASSO in the E. coli-derived networks. The expectedscore based on random for each metric is shown as a dashed line. n = 20networks, kmin = 1, kmax = 3, and w = 10 for all networks. P values were cal-culated by using the Wilcoxon signed-rank test, ***P < 0.001; **P < 0.01;*P < 0.05. (B) SWING and non-SWING methods are grouped accordingto similarity of ranked predictions for 40 10-node in silico networks viaPCA. PC1 largely separates inference methods based on performance (SIAppendix, Fig. S2), while PC2 separates methods based on underlying basemethod. Networks inferred by various SWING parameter selections clustertogether according to inference type, with SWING methods forming clustersdistinct from corresponding base methods.

biases toward specific network motifs (14). Along PC2, edgerankings appeared to separate based on the internal base method(15% variance explained), while along PC3, SWING edge rank-ings appeared to separate from those of their base methods(5% variance explained). These results suggest that SWINGrecovered connectivities that were distinct from those recoveredfrom R/L/P.

Given that it is difficult to determine a priori which methodsperform optimally in different contexts, deriving a communitynetwork is a good strategy for robustly improving predictions(14). We evaluated the performance of SWING-Community,which combines SWING-RF, -LASSO, and -PLSR predictionsby calculating the mean rank across all methods for each pos-

sible edge. We note that SWING-Community outperformedRF, resulting in a 52% and 8% mean increase in AUPR andAUROC, respectively, suggesting that SWING infers distinctand complementary networks (SI Appendix, Fig. S3).

SWING Improves Network Inference by Promoting Time-DelayedEdges. Endogenous reactions, such as protein translation, post-translational modifications, translocation, or oligomerization,are often not accounted for in GRNs. However, even if under-lying network kinetics are linear (or approximately linear), theresulting dynamics can appear delayed when not all nodes areobserved (SI Appendix, Fig. S4A). Delayed behavior in geneexpression and protein translation has been established in sev-eral studies (26, 27).

We estimated the apparent time delay of each interaction ina 10-node GNW network by calculating the pairwise peak cross-correlation between time series of all true regulator and targetcombinations. The majority of true interactions within GNW net-works had a time delay between 0 and 150 min (SI Appendix,Fig. S4B). We observed that SWING was more likely to pro-mote edges with an identifiable delay within the range of user-specified parameters (SI Appendix, Fig. S5A). Across all in sil-ico networks, SWING-RF promoted 65.8% of true edges with adelay vs. 55.4% of true edges without a delay (P = 0.018), andSWING-PLSR promoted 67.0% of true edges with a delay vs.47.1% of true edges without a delay (P = 6.00e-6) (SI Appendix,Fig. S5B).

Many of the promoted edges with an identifiable delay werehighly ranked by base methods RF and PLSR. In general,delayed true edges ranked in the first quartile by the base methodwere likely to be promoted, while those ranked lower were nomore likely to be promoted than nondelayed true edges (SIAppendix, Fig. S5B). While SWING was more likely to promotetrue edges with a delay, the magnitude of this promotion wasnot consistent across the different base methods or networks.SWING-RF promoted true edges with an apparent time delayby an average of 7.50 ranks relative to true edges without anapparent time delay (P = 4.75e-3) for S. cerevisiae-derived net-works. In contrast, SWING-PLSR promoted true edges with anapparent delay by an average of 7.78 ranks relative to true edgeswithout an apparent time delay (P = 6.89e-5) for E. coli-derivednetworks (SI Appendix, Fig. S5B). In one example, S. cerevisiaenetwork 12, SWING-RF improved the AUROC from 0.539 to0.872, a 61.7% increase relative to the base method. Comparedwith RF, the edge ranking for SWING-RF promoted many trueedges, and all of the true edges with a delay were promoted bySWING (SI Appendix, Fig. S6A).

To demonstrate how SWING promoted delayed edges, wehighlighted the true edge between gene 2 (G2) and gene 1 (G1)in S. cerevisiae network 12. G2 is the only node upstream ofG1, and the input data included an experiment where only G2was perturbed; thus, the delay between G2 stimulation and G1response was unambiguously isolated (SI Appendix, Fig. S7A).We estimated the delay between G2 and G1 as two time points,or 100 min. We shifted the G1 time series by two time pointsto show that the Pearson correlation of the resulting time seriesnotably increases (SI Appendix, Fig. S6B).

SWING Infers Apparent Time-Delayed Edges with Greater Sensitiv-ity in the E. coli SOS Network. We applied SWING to an in vitroeight-node E. coli GRN that activates with DNA damage (20,28). The SOS network contains several complex interactions,including multiple cascades and feedback loops generated bya combination of transcriptional activators and repressors. Wecomputed the mean of three replicates for each time point fol-lowing DNA damage induced by norfloxacin treatment (29).

The sampling strategy for the in vitro SOS data are differentfrom that of the in silico GNW data. Due to fewer time points,

2254 | www.pnas.org/cgi/doi/10.1073/pnas.1710936115 Finkle et al.

Dow

nloa

ded

by g

uest

on

Janu

ary

18, 2

021

Page 4: Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1, Jia J. Wu

SYST

EMS

BIO

LOG

YBI

OPH

YSIC

SA

ND

COM

PUTA

TIO

NA

LBI

OLO

GY

we were restricted to assessing interactions with shorter possi-ble time delays. Using w =0.5·T =7, kmin =0, and kmax =1,SWING-RF inferred the network more accurately than otherreported inference algorithms including RF, LASSO, TSNI (29),and BANJO (30). Because RF is a stochastic method, we ranboth RF and SWING-RF 50 times on the SOS network. Onaverage, SWING-RF increased the AUPR from 0.286 to 0.356(24.6%, P = 1.41e-13) and the AUROC from 0.756 to 0.819(8.3%, P = 5.28e-34). To assess promotion of time-delayededges, we calculated the mean edge ranks across all 50 runsand compared the resulting lists. Although SWING-RF demotedsome true edges, it promoted all three edges that exhibited atime delay (Fig. 3A). We highlighted the edge between lexA andumuDC (SI Appendix, Fig. S7B), which had an estimated lag of6 min. When the umuDC time series was shifted by this amount,the correlation between lexA and umuDC increased from 0.709to 0.928 (Fig. 3B). These findings reaffirmed that SWINGimproves network inference, in part, by promoting edges withidentifiable delays.

SWING Accurately Infers RegulonDB Modules with Time-DelayedEdges. We curated microarray data to infer time-delayed edgesfrom experimentally validated GRNs in E. coli (Fig. 4A) and S.cerevisiae (SI Appendix, Fig. S8). This curated data were aggre-gated across 18 datasets for E. coli and 8 datasets for S. cere-visiae, where data were unevenly sampled for time intervalsthat ranged from 5 to 120 min (SI Appendix, Table S2). Toassess the landscape of apparent time delays present in thesegene expression data, we performed pairwise cross-correlationlag selection between experimentally confirmed edges (31). Wereveal that of 2,870 experimentally confirmed edges, only 23.7%exhibited an apparent time delay of 0, and 13.7% exhibited atime delay of at least 10 min. Surprisingly, only 37.4% of con-

RF SWING-RF

0

10

20

30

40

50

Rank

lexA→uvrA

recA→lexA

lexA→uvrD

lexA→recA

lexA→um

uDC

Lag = 1

lexA→uvrY

lexA→po

lB

Lag = 1

lexA→ruvA

Lag = 1

2.5 5.0 7.5 10.0 12.5Time

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

Nor

mal

ized

exp

ress

ion lexA

umuDCumuDC-shifted

−1 0 1lexA normalized expression

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

Nor

mal

ized

exp

ress

ion

umuDCumuDC-shifted

R2 = 0.709R2 = 0.928

A B

Fig. 3. SWING promotes edges with apparent time delays and increasescorrelation between genes. The true network structure is provided in SIAppendix, Fig. S7B. (A) Edge rank comparison for E. coli SOS network whenusing RF and SWING-RF (blue, promoted edges; red, demoted edges; black,no change; gray, false edges; green, lexA → umuDC analyzed in B). Wereport the lag for edges with an apparent time delay. (B, Upper) Time seriesfor lexA and umuDC show better alignment when umuDC is shifted by onetime period, (B, Lower) which improves correlation between the genes.

Ontologyunassignedputrescine catabolic processiron ion homeostasismetal ion bindingrhamnose metabolic processchemotaxispurine nucleotide biosynthetic processcellular response to DNA damage stimulusorganic phosphonate catabolic processglycogen metabolic processtryptophan catabolic processcellular amino acid biosynthetic processunassignedintracellular ribonucleoprotein complexDNA replicationarginine biosynthetic processpropionate catabolic process, 2-methylcitrate cycledrug transmembrane transportbacterial-type flagellum organizationleucine biosynthetic processgalactose metabolic processD-galacturonate catabolic processcarbohydrate transportL-threonine catabolic process to propionateunassignedoxidation-reduction processresponse to copper ionsulfate assimilationnucleoside transportfatty acid metabolic processoxidation-reduction processunassignedD-gluconate metabolic processcellular amino acid biosynthetic processaromatic amino acid family biosynthetic process

Experimentally validated edgeExperimentally validated edge with time-delay > 0 min

*

*

Delayed edges > 10% SWING-CR/L/P

Delayed edges < 10% SWING-CR/L/P

AUROC

Delayed edges > 10% SWING-C

Delayed edges < 10% SWING-CR/L/P

AUPR

tdcGtdcEtdcCtdcB tdcFtdcD

tdcA

tdcGtdcEtdcCtdcB tdcF

crp fnr

tdcD

SWING-C R/L/P

tdcR

undefined or zero laglaggedfalse positive

20 min lag

10 min lagTranscription factor

Enzyme Transporter

crp fnr

tdcR tdcA

exclusive

9 1713

18

24

819

3129

2

1

534202510

147

216

327

2215

330

230

2811

26 3216 12 4

ID012345678910111213141516171819202122232425262728293031323334

R/L/P

A

B

C

Fig. 4. Application of SWING on time-delayed GRN modules in E. coli.(A) Circular diagram depicts experimentally validated interactions andgene ontologies present in each module (RegulonDb). Blue edges depicttime-delayed interactions inferred by using pairwise cross-correlation fromcurated microarray data. (B) SWING-Community, with w = 4, kmin = 1,kmax = 1 applied to RegulonDb subnetworks that are and are not enrichedwith time-delayed edges (fraction of delayed edges is >10%, n = 12 subnet-works; fraction of delayed edges is <10%, n = 14 subnetworks). (C) SWING-Community and R/L/P ensemble method applied to tdcABC regulon, whichis the module found to have the highest enrichment of time-delayed edges(44% edges with a time delay of 10 min or greater).

firmed edges exhibited pairwise correlation (R > 0.7, P < 1e-5;Fig. 4A).

To determine whether lag is associated with modularity andfunction, we clustered the E. coli and S. cerevisiae network intosmaller modules using MCODE (32) and performed gene ontol-ogy enrichment analysis. Several modules, such as those asso-ciated with catabolic processes and metal ion binding, wereenriched with time-delayed edges of at least 10 min (SI Appendix,Tables S3 and S4). Transcription factors known to regulate geneson a global or combinatorial scale tend to exhibit similar timedelays (SI Appendix, Table S5).

To determine if SWING more accurately infers network struc-ture in diverse contexts, we performed cubic spline interpola-tion to generate evenly sampled time-series gene expression at10-min intervals and benchmarked SWING-Community perfor-mance against an ensemble model of R/L/P base for each clus-tered module using this dataset. SWING-Community outper-formed R/L/P in subnetworks in which >10% of edges weretime-delayed (n = 26 clusters, 9 clusters with <10 genes, or <3transcription factors were removed from analysis, P = 0.031;

Finkle et al. PNAS | February 27, 2018 | vol. 115 | no. 9 | 2255

Dow

nloa

ded

by g

uest

on

Janu

ary

18, 2

021

Page 5: Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1, Jia J. Wu

Fig. 4B). As an example, we identified time-delayed propertiesof key regulators of the tdcABC E. coli operon that are respon-sible for the transport of threonine and serine during anaero-bic growth (33). In particular, our analysis identified two globaltranscription factors that bind combinatorially to induce activ-ity in the tdcABC operon. Crp and fnr are global regulatorsthat respond to glucose starvation and anaerobic growth, respec-tively (34, 35).

Interestingly, lag analysis identified 10- and 20-min time delaysbetween crp and target genes in the E. coli tdcABC operon. Whilethe precise delay identified by our analysis was not consistentwith that observed in experiments, studies confirmed that a delayexisted between crp induction and the induction of several targetgenes (36). This delay can possibly be attributed to posttransla-tional modification of crp (37). Of 32 edges in the gold standard,SWING identified 27 true-positive (TP) edges and 5 false-positive(FP) edges (85% TP), while the ensemble model predicted24 TP edges and 8 FP edges (75% TP). In this example, SWING-Community inferred both time-delayed and non-time-delayededges more sensitively than the R/L/P ensemble model. The FPedges inferred by SWING-Community were also within the sub-set of FP edges inferred by the base community method.

SWING Performance Is Robust Across Parameters. SWING addsuser-defined parameters to baseline methods, which are neces-sary for window creation and time-delay inference. The selec-tion of these parameters was both context- and data-specific.We conducted parametric sensitivity analysis of SWING as afunction of window size, combinations of kmin and kmax , andexperimental sampling interval in context of the in silico net-works and the E. coli SOS network (SI Appendix, Figs. S9–S14). While SWING outperformed baseline methods over awide range of window sizes (SI Appendix, Fig. S9), the perfor-mance of a single network may differ from other networks, sug-gesting that the optimal window size is partially dependent onthe underlying inference method and network structure. There-fore, user-specified SWING parameters—kmin , kmax , and w—should be chosen based on the data and are discussed in detailin SI Appendix, Sensitivity Analysis. Overall SWING outperformsbaseline methods for a wide range of possible parameters (SIAppendix, Figs. S9–S13).

DiscussionTight regulation of gene expression is critical to maintainingrobust responses to perturbations and environmental distur-bances, and misregulation of intracellular signaling dynamicscan lead to a wide variety of diseases. For this reason, uncov-ering the topology of GRNs is of fundamental interest to thescientific community, since the resulting maps can be used toidentify interventions to control cellular phenotypes. Many cur-rent methods disregard temporal information and are limitedin their ability to accurately infer network topology. Indiffer-ence to time delays will be the Achilles heel of many systemsbiology strategies. We developed a general temporal frameworkfor network inference that accurately uncovers the regulatorystructures governing complex biological systems by accountingfor these fundamental delays. SWING improves upon existingGranger methods by generating an ensemble of windowed mod-els that simultaneously evaluate multiple upstream regulators atseveral potential time delays. We validated its utility and per-formance in several in silico (Fig. 2A) and in vitro (Figs. 3 and4B) systems.

Consideration of Time Delays Improves SWING Performance andShould Be Integrated in Experimental Design. Our in silico and invitro results demonstrate that promoted edges were enriched forthose with apparent time delays (SI Appendix, Fig. S5B), suggest-ing that network inference was improved, in part, by accounting

for temporal information. We supported this finding by demon-strating that SWING-RF promotes an edge with a distinct andsingular delay (SI Appendix, Fig. S6A). We also used SWINGto predict directed edges of several E. coli subnetworks usingcubic spline interpolated microarray datasets. Through cross-correlation analysis, we estimated time-delayed interactions inin silico, E. coli, and S. cerevisiae networks, and showed thatSWING performed better than baseline methods in moduleswith more frequent time-delayed edges, such as the tdcABCregulon.

Interestingly, the apparent time delay only partially explainedimproved performance, as SWING also promoted edges withoutapparent time delays in in silico and in vitro networks. This dis-crepancy may have arisen from our conservative approach foridentifying time delays; a more liberal approach could assigntime delays to a greater fraction of the promoted edges. How-ever, it is particularly challenging to estimate time delays forgenes with multiple regulators by using cross-correlation. Morecomplex algorithms that incorporate additional information (i.e.,nonlinearity and partial correlation) could improve time-delayestimation between regulators and targets (38).

An additional consideration involves interactions that occurfaster than the sampling interval. These interactions will notexhibit a delay in the time series and will resist inference andestimation of time delay regardless of methodology. This bot-tleneck can be managed by designing experiments with shortersampling intervals. The choice of sampling interval is context-specific, and we recommend sampling with sufficient frequencyto capture dynamics of interest.

SWING Outperforms Common Network Inference Algorithms AcrossScales. SWING outperforms common network inference algo-rithms—R/L/P—but is limited by computational expense. SinceSWING constructs a larger explanatory matrix and executesmultivariate comparisons between multiple time delays, it ismore expensive than the aforementioned methods. Fortunately,SWING is trivially parallelizable and can be implemented onany multicore processing system. We conducted similarly derived100-node in silico networks and found that SWING increasedthe AUPR and AUROC for all three methods (SI Appendix,Fig. S1), including SWING-LASSO, which had no significant dif-ference for the 10-node networks (Fig. 2A). Remarkably, everysingle network was inferred with greater accuracy, indicatingthat SWING has notable benefits for larger inference tasks (SIAppendix, Fig. S1 and Table S1).

SWING Is an Extensible Framework. Compared with other time-delayed inference algorithms, SWING is a flexible and extensibleframework that is not limited to using a single statistical method.The SWING framework was implemented with R/L/P; it can beeasily expanded to use other multivariate inference algorithms,including those that use prior information and heterogeneousdata types (39). Additional improvements can be made by incor-porating complex weighting of methods for consensus analysisthat leverage known weaknesses and biases of inference meth-ods. Methods that involve empirical optimization of combinationweights, such as those assessed in the DREAM challenge, areexpected to substantially improve SWING performance (40).

Although we implemented SWING to infer interactions fromgene expression data, the same Granger causality principles canbe applied to a wide variety of contexts with temporal dynam-ics. Provided sufficient time-series data, we expect SWINGto identify regulatory relationships in related intracellular sig-naling pathways, as well as broader fields such as ecology,social sciences, and economics. As the sensitivity/specificity ofexperimental tools increases and the cost of implementationdecreases, we expect longer and higher-resolution time-seriesdata to become widely available. We expect this increase in

2256 | www.pnas.org/cgi/doi/10.1073/pnas.1710936115 Finkle et al.

Dow

nloa

ded

by g

uest

on

Janu

ary

18, 2

021

Page 6: Windowed Granger causal inference strategy improves ... · Windowed Granger causal inference strategy improves discovery of gene regulatory networks Justin D. Finkle a,1, Jia J. Wu

SYST

EMS

BIO

LOG

YBI

OPH

YSIC

SA

ND

COM

PUTA

TIO

NA

LBI

OLO

GY

time resolution to further improve the accuracy of SWING-based network inference, especially as the community contin-ues to build on the SWING chassis. The SWING framework,with currently implemented methods, is available on GitHub(https://github.com/bagherilab/SWING).

Materials and MethodsThe SWING algorithm is described in detail in SI Appendix, SI Materials andMethods, including parameter selection, management of time-series dataand window creation, model aggregation, and graph generation. In silico

simulations and in vitro data aggregation are also described in SI Appendix,SI Materials and Methods. The sensitivity of SWING performance as a func-tion of user-defined parameters is described in SI Appendix, SI SensitivityAnalysis.

ACKNOWLEDGMENTS. This research was supported, in part, by Biotech-nology Training Program Grant T32 GM008449 (to J.D.F.), NIH NationalHeart, Lung, and Blood Institute Award F31HL134331-02 (to J.J.W.), NSFCAREER Award CBET-1653315 (to N.B.), the Quest high performance com-puting facility, and the McCormick School of Engineering at NorthwesternUniversity.

1. Ciaccio MF, Finkle JD, Xue AY, Bagheri N (2014) A systems approach to integrativebiology: An overview of statistical methods to elucidate association and architecture.Integr Comp Biol 54:296–306.

2. Nagoshi E, et al. (2004) Circadian gene expression in individual fibroblasts: Cell-autonomous and self-sustained oscillators pass time to daughter cells. Cell 119:693–705.

3. Spellman PT, et al. (1998) Comprehensive identification of cell cycle–regulated genesof the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell9:3273–3297.

4. Geva-Zatorsky N, et al. (2006) Oscillations and variability in the p53 system. Mol SystBiol 2:2006.0033.

5. Jiang YJ, et al. (2000) Notch signalling and the synchronization of the somite segmen-tation clock. Nature 408:475–479.

6. Madar A, Greenfield A, Ostrer H, Vanden-Eijnden E, Bonneau R (2009) The Inferelator2.0: A scalable framework for reconstruction of dynamic regulatory network models.Conference Proceedings: Annual International Conference of the IEEE Engineeringin Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society,2009 (IEEE, Piscataway, NJ), 5448–5451.

7. van Someren EP, et al. (2006) Least absolute regression network analysis of themurine osteoblast differentiation network. Bioinformatics 22:477–484.

8. Aijo T, Granberg K, Lahdesmaki H (2013) Sorad: A systems biology approach to predictand modulate dynamic signaling pathway response from phosphoproteome time-course measurements. Bioinformatics 29:1283–1291.

9. Zak DE, Gonye GE, Schwaber JS, Doyle FJ (2003) Importance of input perturbationsand stochastic gene expression in the reverse engineering of genetic regulatory net-works: Insights from an identifiability analysis of an in silico network. Genome Res13:2396–2405.

10. Raue A, Becker V, Klingmuller U, Timmer J (2010) Identifiability and observabilityanalysis for experimental design in nonlinear dynamical models. Chaos 20:045105.

11. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc SeriesB Stat Methodol 73 267–288.

12. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P (2010) Inferring regulatory networksfrom expression data using tree-based methods. PLoS One 5:e12776.

13. Lawrence ND, Sanguinetti G, Rattray M (2007) Modelling transcriptional regulationusing Gaussian Processes. Advances in Neural Information Processing Systems 19, edsScholkopf PB, Platt JC, Hoffman T (MIT Press, Cambridge, MA), pp 785–792.

14. Marbach D, et al. (2012) Wisdom of crowds for robust gene network inference. NatMethods 9:796–804.

15. Haury AC, Mordelet F, Vera-Licona P, Vert JP (2012) Tigress: Trustful inference of generegulation using stability selection. BMC Syst Biol 6:145.

16. Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from databy sparse identification of nonlinear dynamical systems. Proc Natl Acad Sci USA113:3932–3937.

17. Huynh-Thu VA, Sanguinetti G (2015) Combining tree-based and dynamical systemsfor the inference of gene regulatory networks. Bioinformatics 31:1614–1622.

18. Granger CW (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37:424–438.

19. Lozano AC, Abe N, Liu Y, Rosset S (2009) Grouped graphical granger modeling forgene expression regulatory networks discovery. Bioinformatics 25:i110–i118.

20. Zoppoli P, Morganella S, Ceccarelli M (2010) Timedelay-aracne: Reverse engineeringof gene networks from time-course data by an information theoretic approach. BMCBioinformatics 11:154.

21. Shojaie A, Michailidis G (2010) Discovering graphical granger causality using the trun-cating lasso penalty. Bioinformatics 26:i517–i523.

22. Petralia F, Wang P, Yang J, Tu Z (2015) Integrative random forest for gene regulatorynetwork inference. Bioinformatics 31:i197–205.

23. Quinn CJ, Coleman TP, Kiyavash N, Hatsopoulos NG (2011) Estimating the directedinformation to infer causal relationships in ensemble neural spike train recordings. JComput Neurosci 30:17–44.

24. Schaffter T, Marbach D, Floreano D (2011) GeneNetWeaver: In silico benchmark gen-eration and performance profiling of network inference methods. Bioinformatics27:2263–2270.

25. Ciaccio MF, Chen VC, Jones RB, Bagheri N (2015) The dionesus algorithm provides scal-able and accurate reconstruction of dynamic phosphoproteomic networks to revealnew drug targets. Integr Biol 7:776–791.

26. Gedeon T, Bokes P (2012) Delayed protein synthesis reduces the correlation betweenmRNA and protein fluctuations. Biophysical J 103:377–385.

27. McAdams HH, Arkin A (1997) Stochastic mechanisms in gene expression. Proc NatlAcad Sci USA 94:814–819.

28. Ronen M, Rosenberg R, Shraiman BI, Alon U (2002) Assigning numbers to the arrows:Parameterizing a gene regulation network by using accurate expression kinetics. ProcNatl Acad Sci USA 99:10555–10560.

29. Bansal M, Della Gatta G, di Bernardo D (2006) Inference of gene regulatory networksand compound mode of action from time course gene expression profiles. Bioinfor-matics 22:815–822.

30. Morshed N, Chetty M, Xuan Vinh N (2012) Simultaneous learning of instantaneousand time-delayed genetic interactions using novel information theoretic scoring tech-nique. BMC Syst Biol 6:62.

31. Boker SM, Xu M, Rotondo JL, King K (2002) Windowed cross-correlation and peakpicking for the analysis of variability in the association between behavioral timeseries. Psychol Methods 7:338–355.

32. Bader GD, Hogue CW (2003) An automated method for finding molecular complexesin large protein interaction networks. BMC Bioinformatics 4:2.

33. Ganduri YL, Sadda SR, Datta MW, Jambukeswaran RK, Datta P (1993) TdcA, a tran-scriptional activator of the tdcABC operon of Escherichia coli, is a member of the Lysrfamily of proteins. Mol Gen Genet 240:395–402.

34. Shimada T, Fujita N, Yamamoto K, Ishihama A (2011) Novel roles of cAMP receptorprotein (CRP) in regulation of transport and metabolism of carbon sources. PLoS One6:e20081.

35. Crack J, Green J, Thomson AJ (2004) Mechanism of oxygen sensing by the bacte-rial transcription factor fumarate-nitrate reduction (FNR). J Biol Chem 279:9278–9286.

36. Kao KC, Tran LM, Liao JC (2005) A global regulatory role of gluconeogenic genes inEscherichia coli revealed by transcriptome network analysis. J Biol Chem 280:36079–36087.

37. Davis R, et al. (2018) An acetylatable lysine controls CRP function in E. coli. Mol Micro-biol 107:116–131.

38. Runge J, et al. (2015) Identifying causal gateways and mediators in complex spatio-temporal systems. Nat Commun 6:8502.

39. Studham ME, Tjarnberg A, Nordling TE, Nelander S, Sonnhammer ELL (2014) Func-tional association networks as priors for gene regulatory network inference. Bioin-formatics 30:i130–i138.

40. Hill SM, et al. (2016) Inferring causal molecular networks: Empirical assessmentthrough a community-based effort. Nat Methods 13:310–318.

Finkle et al. PNAS | February 27, 2018 | vol. 115 | no. 9 | 2257

Dow

nloa

ded

by g

uest

on

Janu

ary

18, 2

021