arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

17
Deep learning reveals hidden interactions in complex systems Seungwoong Ha and Hawoong Jeong * Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea (Dated: November 13, 2020) Rich phenomena from complex systems have long intrigued researchers, and yet modeling system micro-dynamics and inferring the forms of interaction remain challenging for conventional data- driven approaches, being generally established by human scientists. In this study, we propose AgentNet, a model-free data-driven framework consisting of deep neural networks to reveal and analyze the hidden interactions in complex systems from observed data alone. AgentNet utilizes a graph attention network with novel variable-wise attention to model the interaction between individual agents, and employs various encoders and decoders that can be selectively applied to any desired system. Our model successfully captured a wide variety of simulated complex systems, namely cellular automata (discrete), the Vicsek model (continuous), and active Ornstein–Uhlenbeck particles (non-Markovian) in which, notably, AgentNet’s visualized attention values coincided with the true interaction strength and exhibited collective behavior that was absent in the training data. A demonstration with empirical data from a flock of birds showed that AgentNet could identify hidden interaction ranges exhibited by real birds, which cannot be detected by conventional velocity correlation analysis. We expect our framework to open a novel path to investigating complex systems and to provide insight into general process-driven modeling. I. INTRODUCTION Complex systems are collections of interactive agents that exhibit non-trivial collective behavior. They have gathered a significant amount of research interest in the last several decades in a wide variety of academic fields from spin systems to human societies. In particular, the domain of physics mainly focuses on investigating the micro-level processes that govern emergent behav- ior in complex systems and modeling them mathemati- cally. The Vicsek model[1] is a representative example of such approaches, which attempted to explain collective behaviors of active matter like a bird flock with mini- mal microscopic description. Unfortunately, due to the intrinsic complexity of these systems, extracting hidden micro-dynamics from the observed data of an unknown complex system is virtually infeasible in most cases. Al- though conventional process-driven modeling is intelligi- ble and provides the conceptual framework, its applica- tion to complex systems, to date, still strongly relies on human intuition with various prior assumptions. To overcome these obstacles, data-driven modeling (DDM), a methodology that finds a relationship between state variables or their time evolution from observed data, has emerged as a powerful tool for system analysis alongside the emergence of machine learning and large- scale data. In previous literature [2–10], DDM was em- ployed to discover hidden parameters or dynamics from data in an automated manner. Particularly, active mat- ter modeling greatly relies on DDM by first designing a model with intuition from observed data and then per- forming parameter fitting to match the data [11–15], al- * Also at Center for Complex Systems, Korea Advanced In- stitute of Science and Technology, Daejeon 34141, Korea; [email protected] though many of them suffer from sparse, noisy, or dis- continuous observation data. Among various DDM techniques, deep neural networks (DNNs) have recently shown phenomenal performance in pattern recognition and function approximation. One specialized DNN variant for graph-structured data is the graph neural network (GNN) [16], which models depen- dencies between linked agents on a graph and has en- abled remarkable progress in graph analysis. Similar to [17], one may depict a complex system as a dynamically changing graph in which each vertex is an agent, with links between agents indicating interactions. In this ap- proach, the problem of modeling the micro-dynamics of single agents becomes equivalent to properly inferring the effect from other agents on a graph and estimat- ing the state transition of each agent at the next time step. Several attempts have been separately made to employ GNNs in the prediction and analysis of specific complex systems and physical models [18–26], but these approaches are mostly limited to the verification of a sin- gle system or a small number of agents, and more signif- icantly, it remains difficult to interpret the characteris- tics of the interaction due to the neural network’s noto- rious black-box nature. Recently, graph attention net- works (GATs) [27] and its applications showed a path to interpretable GNN by assigning attention to important neighbors, but this attention value cannot be directly in- terpreted with a physical meaning. For instance, in a multi-dimensional system, the interaction strength can- not be a scalar value since each directional state variable possesses its own interaction range and strength. Inspired by these recent attempts, we introduce Agent- Net, a generalized neural network framework based on GATs with a novel attention scheme to model various complex systems in an physically interpretable manner. AgentNet approximates the transition function of the states of individual agents by training the neural net- arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

Transcript of arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

Page 1: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

Deep learning reveals hidden interactions in complex systems

Seungwoong Ha and Hawoong Jeong∗

Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea(Dated: November 13, 2020)

Rich phenomena from complex systems have long intrigued researchers, and yet modeling systemmicro-dynamics and inferring the forms of interaction remain challenging for conventional data-driven approaches, being generally established by human scientists. In this study, we proposeAgentNet, a model-free data-driven framework consisting of deep neural networks to reveal andanalyze the hidden interactions in complex systems from observed data alone. AgentNet utilizesa graph attention network with novel variable-wise attention to model the interaction betweenindividual agents, and employs various encoders and decoders that can be selectively applied toany desired system. Our model successfully captured a wide variety of simulated complex systems,namely cellular automata (discrete), the Vicsek model (continuous), and active Ornstein–Uhlenbeckparticles (non-Markovian) in which, notably, AgentNet’s visualized attention values coincided withthe true interaction strength and exhibited collective behavior that was absent in the training data.A demonstration with empirical data from a flock of birds showed that AgentNet could identifyhidden interaction ranges exhibited by real birds, which cannot be detected by conventional velocitycorrelation analysis. We expect our framework to open a novel path to investigating complex systemsand to provide insight into general process-driven modeling.

I. INTRODUCTION

Complex systems are collections of interactive agentsthat exhibit non-trivial collective behavior. They havegathered a significant amount of research interest in thelast several decades in a wide variety of academic fieldsfrom spin systems to human societies. In particular,the domain of physics mainly focuses on investigatingthe micro-level processes that govern emergent behav-ior in complex systems and modeling them mathemati-cally. The Vicsek model[1] is a representative example ofsuch approaches, which attempted to explain collectivebehaviors of active matter like a bird flock with mini-mal microscopic description. Unfortunately, due to theintrinsic complexity of these systems, extracting hiddenmicro-dynamics from the observed data of an unknowncomplex system is virtually infeasible in most cases. Al-though conventional process-driven modeling is intelligi-ble and provides the conceptual framework, its applica-tion to complex systems, to date, still strongly relies onhuman intuition with various prior assumptions.

To overcome these obstacles, data-driven modeling(DDM), a methodology that finds a relationship betweenstate variables or their time evolution from observeddata, has emerged as a powerful tool for system analysisalongside the emergence of machine learning and large-scale data. In previous literature [2–10], DDM was em-ployed to discover hidden parameters or dynamics fromdata in an automated manner. Particularly, active mat-ter modeling greatly relies on DDM by first designing amodel with intuition from observed data and then per-forming parameter fitting to match the data [11–15], al-

∗ Also at Center for Complex Systems, Korea Advanced In-stitute of Science and Technology, Daejeon 34141, Korea;[email protected]

though many of them suffer from sparse, noisy, or dis-continuous observation data.

Among various DDM techniques, deep neural networks(DNNs) have recently shown phenomenal performancein pattern recognition and function approximation. Onespecialized DNN variant for graph-structured data is thegraph neural network (GNN) [16], which models depen-dencies between linked agents on a graph and has en-abled remarkable progress in graph analysis. Similar to[17], one may depict a complex system as a dynamicallychanging graph in which each vertex is an agent, withlinks between agents indicating interactions. In this ap-proach, the problem of modeling the micro-dynamics ofsingle agents becomes equivalent to properly inferringthe effect from other agents on a graph and estimat-ing the state transition of each agent at the next timestep. Several attempts have been separately made toemploy GNNs in the prediction and analysis of specificcomplex systems and physical models [18–26], but theseapproaches are mostly limited to the verification of a sin-gle system or a small number of agents, and more signif-icantly, it remains difficult to interpret the characteris-tics of the interaction due to the neural network’s noto-rious black-box nature. Recently, graph attention net-works (GATs) [27] and its applications showed a path tointerpretable GNN by assigning attention to importantneighbors, but this attention value cannot be directly in-terpreted with a physical meaning. For instance, in amulti-dimensional system, the interaction strength can-not be a scalar value since each directional state variablepossesses its own interaction range and strength.

Inspired by these recent attempts, we introduce Agent-Net, a generalized neural network framework based onGATs with a novel attention scheme to model variouscomplex systems in an physically interpretable manner.AgentNet approximates the transition function of thestates of individual agents by training the neural net-

arX

iv:2

001.

0253

9v4

[co

nd-m

at.s

tat-

mec

h] 1

2 N

ov 2

020

Page 2: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

2

𝑠𝑖𝑡 = 𝑓

𝑖≠𝑗

𝑞

𝑤𝑖𝑗𝑞𝑣𝑗 , ℎ𝑠𝑒𝑙𝑓

(𝑥, 𝑦, 𝑣𝑥 , 𝑣𝑦)𝒔𝑡

𝑠𝑖𝑡

𝑠𝑗𝑡

(Δ𝑥, Δ𝑦, Δ𝑣𝑥, Δ𝑣𝑦)𝚫𝒔𝑡

𝑠𝑖𝑡+1

𝑠𝑗𝑡+1

𝑘𝑖𝑡

𝑞𝑖𝑡𝑣𝑖𝑡

Encoder

𝑘𝑗𝑡

𝑞𝑗𝑡

𝑣𝑗𝑡

Variable-wise

Neural Attention

𝛼𝑖𝑗𝑥,𝑡

(Separated)

Variable-wise

Decoder

𝑖≠𝑗

AgentNet

Complex system

(1) Aware of neighbors (4) Decide transition(2) Consider strength

×××

×𝛼𝑖𝑗𝑦,𝑡

𝛼𝑖𝑗𝑣𝑥,𝑡

𝛼𝑖𝑗𝑣𝑦,𝑡

MLP

⊕𝑢𝑡

𝑢𝑡

𝑢𝑡+1

𝒔𝒕+𝟏 = 𝒔𝒕 + 𝚫𝒔𝑡

Key

Query

Value

𝑥𝑦𝑣𝑥𝑣𝑦

(3) Get influenced

, ,

,

FIG. 1. Overview of system formulation and the neural network architecture of the proposed AgentNet. The correspondencebetween the decision rule of agents in a complex system and a forward pass of AgentNet is depicted. In both panels, the statevariable of each agent sti interacts with the state variables of other agents stj in Ri with interaction strength αt

ij . The graphattention core learns Ri with transformer architecture by encoding sti into key kti , query qti , and value vti , and then calculatesthe weighted sum of the values of other agents vtj according to the variable-wise attention weight αq,t

ij as computed by neuralattention. Different from GATs, AgentNet assigns attention value for each state variable, and decode it separately to strictlyimpose the information of variable-wise interaction strength. Other functions, namely hself and f , can be captured by bothencoder and decoder modules.

work to predict the future state variables. Due to therich functional expressibility of DNNs, which is practi-cally unconstrained [28, 29], AgentNet poses minimumprior assumptions about the unknown nature of the tar-get agents. Our model jointly learns the interactionstrength that affects each variable’s transition and over-all transition function from observed data in an end-to-end manner without any human intervention or manualoperation. This is a critical difference from the conven-tional approach with GATs, which only assigns a singleattention value per agent while our model assigns com-pletely independent attention values for every state vari-able and employing separate decoders for each of them.We found that our variable-wise attention achieves bet-ter performances over GATs, and enables more extensivephysical interpretation for the first time that was impossi-ble for conventional GATs such as identifying directionalforces separately. Also, the visualization and inspectionof the inner modules as granted by our framework en-ables a clear interpretation of the trained model, whichalso provides insights for process-driven modeling. As aprediction model, a trained AgentNet can generate anindividual level of state predictions from desired initialconditions, making AgentNet an outstanding simulatorof target systems including even those that exhibit col-lective behavior that was absent in the training data.

First, we show the spontaneous correspondence be-tween the complex system and the structure of AgentNetby providing formulations of both systems. The capabil-

ity of AgentNet is thoroughly demonstrated here via datafrom simulated complex systems: cellular automata [30],the Vicsek model [1], and the active Ornstein–Uhlenbeckparticle (AOUP) [31] model, along with application toreal-world data comprising trajectories in a flock of birds[32] containing more than 1800 agents in a single in-stance, greatly exceeding the previous range of neuralnetwork approaches [17, 33, 34] which treated at mostseveral dozens of agents. For the simulated systems,we show that each component of AgentNet learns pre-dictable and tractable parts of the expected transitionfunction by comparing extracted features with ground-truth functions. For the bird flock where the exact ana-lytical expression of the system is completely unknown,AgentNet successfully provides the interaction range ofa bird, which is physiologically plausible and coincideswith previous behavioral studies about the bird [35, 36].

II. SYSTEM FORMULATION

In this paper, we focus on a general agent-based sys-tem consisting of n agents for which the state of eachagent until time T is (at least partially) identified andobserved. The basic premise of the agent-based systemis that the agent with the same state variable follows thesame decision rule, and the interaction strength betweentwo agents can be fully expressed by their state variable.

Page 3: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

3

TABLE I. System formulation is applied to simulated model systems. Here, h =∑

j hj , hx = {x directional component of h},{hy = y directional component of h}, and |Ri| denote the number of elements of set Ri. r(sti, s

tj) represents the distance

between two agents’ positions, while θ(sti, stj) represents the respective angle of the jth agent to the ith agent. For AOUP, t+ 1

becomes t+ dt since the original model is goverened by continuous differential equations.

System Cellular automata Vicsek model Active Ornstein–Uhlenbeck Particle

sti {xti, yti , cti} {xti, yti , vtxi, vtyi} {xti, yti}

ut None None R

Ri {aj ∈ A|r(sti, stj) ≤√

2} {aj ∈ A|r(sti, stj) < rc, |θ(sti, stj)| < θc} A− {ai}

wq wc =

{1 if aj ∈ Ri

0 if aj /∈ Ri

wx, wy =

{1 if aj ∈ Ri

0 if aj /∈ Ri

wx, wvx ∝− 3(xti − xtj)e−r(sti,s

tj)

3/R3

R3

wy, wvy ∝− 3(yti − ytj)e−r(sti,s

tj)

3/R3

R3

hself {cti} {vtx,i, vty,i} {xti, yti}

f

∆xt+1i = 0

∆yt+1i = 0

∆ct+1i = δcti=0δh=3

− δcti=1(1− δh=2)(1− δh=3)

∆xt+1i = vt+1

x,i

∆yt+1i = vt+1

y,i

∆vt+1x,i = (hx / (|Ri|+ 1)) +N (0, σ)

∆vt+1y,i = (hy / (|Ri|+ 1)) +N (0, σ)

∆xt+dti = vt+1

x,i

∆yt+dti = vt+1

y,i

∆vt+dtx,i = hxdt+

√2γT

γN (0,

√dt) + fx,i

∆vt+dty,i = hydt+

√2γT

γN (0,

√dt) + fy,i

∆f t+dtx,i = −dt/τf t

x,i +√U2

0 τTN (0,√dt)

∆f t+dty,i = −dt/τf t

y,i +√U2

0 τTN (0,√dt)

This implies that any two agents with the same statevariables should be interchangeable without altering theoutcome.

We denote the set of all n agents as A ={a1, a2, . . . , an} and the corresponding observed statevariables of all agents at time t as St = {st1, st2, . . . , stn},where each state consists of k state variables sti ={sti,1, sti,2, . . . , sti,k}. In addition, the system might havej number of time-dependent global external variablesut = {ut1, ut2, . . . , utj} that affect agent interaction, suchas temperature in a thermodynamic system. For simplic-ity, we abbreviate the set of time series vectors from t tot−m, namely [St,St−1, ...St−m], as St,m.

Generally, agent modeling of a complex system aimsto identify the transition function of its constituentsthrough time steps, which can be written as

St+1 = St + ∆St+1

= St + F (St,m,ut,m) (1)

where m is the maximum lag for the system output andF is an overall function that could be deterministic orstochastic. If we focus on the state difference of an in-dividual agent, we can split the overall function F intoindvidual transition function f and get

∆st+1i = f(st,mi ,St,m

i,ut,m) (2)

where

∆St+1 = [f(st,m1 ,St,m1

,ut,m), . . . , f(st,mn ,St,mn ,ut,m)]

(3)

and St,mi

indicates that the ith agent’s state vector

st,mi is omitted from St,m.In this study, we assume that the system is mainly

dominated by pairwise interactions and higher-order in-teractions are negligible. Alleviation of this assumptionwill be discussed in the Conclusion. This means thatEq. (2) becomes

∆st+1i = f(hself(s

t,mi ,ut,m),

∑i 6=j

hpair(st,mi , st,mj ,ut,m))

(4)

where hself(st,mi ,ut,m) denotes self-interaction and

hpair(st,mi , st,mj ,ut,m) captures the pairwise interaction

between the ith and jth agents along with the effect ofut,m. We note that this generalized formulation encom-passes the transition functions of various fundamentalsystems such as the Monte Carlo simulation of the Isingmodel [37], the voter model [38], systems governed byNewtonian dynamics, and phase space dynamics drivenby the Liouville equation

Although Eq. (4) sums up the interaction with ev-ery agent except itself, not every agent is relevant tothe transition function of a single agent in a generalcase. Every agent ai may have its own interaction rangeRi = {aj ∈ A | ai interacts with aj} that can change

Page 4: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

4

TABLE II. Models to test the performance of AgentNet and their respective characteristics. The Xs indicate the oppositecharacteristics: discrete, deterministic, Markovian, and simulated data, respectively.

System Continuity Stochasticity Memory effect Empirical data Interaction Remarks

Cellular automata 7 7 7 7 Discrete -

Vicsek model 3 3 7 7 Discrete Generalization

Active OU particle 3 3 3 7 Continuous Collective phenomena

Chimney swift flock 3 3 3 3 Unknown Missing data

depending on the current state of the agent, and only asubset (or possibly the entire set) of agents belonging toRi should be considered. Furthermore, each state vari-able might be affected by different interaction strengths,e.g. exerting force Fx and Fy can be generally different.Hence, we define the variable-wise interaction strengthfunction between two agents as wq

ij(st,mi st,mj ,ut,m) ≥ 0

that outputs the q-th state variable’s interaction magni-tude of the ith agent, induced by the jth agent. Now, wq

ijcan be separated from the pairwise interaction functionh(st,mi , st,mj ,ut,m) to explicitly indicate the variable-wiseinteraction strength between agents, as follows:

hpair(st,mi , st,mj ,ut,m) =

∑q

wqij(s

t,mi , st,mj ,ut,m)vtj(s

t,mj ).

(5)Note that leftover function vj conveys information

solely from the jth agent without loss of generality. Sofar, we have decomposed an individual transition func-tion into four parts; variable-wise interaction strengthfunction wq, leftover function v, self-interaction functionhself, and overall function f . We note that our formu-lation aptly applies to a physical system governed byforce dynamics by interpreting wq as the magnitude ofa component of an exerting force vector, while the left-over function vector v contains directional information.

In most cases, the exact analytic forms of all thesefunctions (wq, v, hself, f) are completely unknown, andit is infeasible to elicit these functions from observeddata alone. Especially, blindness to variable-wise in-teraction strength function wq significantly complicatesthis inverse problem since we have to test every possiblecombination of neighbor candidates while simultaneouslyguessing the correct nonlinear functional form of v, g,and f . The problem becomes harder if the system hastime-correlation because it expands the range of possiblycorrelated variable pairs further out in the time dimen-sion. To sum up, many of the current methodologies arenot capable of DDM for complex systems without strongprior assumptions regarding the functional form. Theproposed framework, AgentNet, successfully tackles thisconundrum by employing DNNs to jointly learn all of theaforementioned functions by constructing correspondingneural modules for each of the functions and backpropa-gating errors from state predictions.

Our formulation of agent-based complex systems isshown in Fig. 1 with corresponding modules in Agent-

Net: the value vector of the transformer captures the self-interaction hself and leftover information v; variable-wiseattention weight αq captures the interaction magnitudewq; and the weighted sum along with the decoder corre-sponds to the overall function f . This formulation canexpress all of the model systems used in this study, asdescribed in Table I.

III. MODELS AND METHODS

AgentNet is a generalized framework for the data-driven modeling of agent-based complex systems, cover-ing most previous works and reinforced with several mod-ifications. The base module of AgentNet, a graph atten-tion module, is similar to a GAT [27, 39] with transformerarchitecture [40], where each agent decides its next stateby putting information from itself and the attention-weighted sum of other agents together. AgentNet ini-tially operates on a fully-connected graph, implying thatit initially assumes every agent as a possible neighborand gradually learns the true interaction partners andstrength through training. Our model first encodes thestate variables of agent st with an encoder, then passesthe information to the transformer which computes theimpact from the entire system state st, and finally de-codes the outcome with a decoder to obtain the statedifference.

In most cases, complex systems have diverse character-istics that are difficult to incorporate into a single mod-eling framework. As a universal framework, AgentNetresolves this diversity by modifying the encoder and de-coder and setting a proper optimization function to fitparticular system characteristics while maintaining thecore module of the network. In this way, AgentNet ad-dresses a variety of system characteristics such as conti-nuity of state variables, stochasticity of transition func-tion, and memory effects.

First, AgentNet can handle various types of state vari-ables by minimizing cross-entropy for discrete variablesand the mean squared error for continuous variables.Second, when the decision rule of a target system isstochastic, there are several ways to construct a neu-ral network with probabilistic output [41–44]. AgentNetemploys a Gaussian neural network [43] as the decoderof the stochastic AgentNet, which produces means andvariances of multiple univariate Gaussian distributions.Lastly, some of the collective phenomena in complex sys-

Page 5: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

5

AEpoch 1

B

Epoch 1 Epoch 40 Epoch 120 Epoch 200

C

FIG. 2. Result of AgentNet for cellular automata. (A) Attention weight transition of a single target cell throughout thetraining. In the initial stage, the model has no information about the interaction range and assigns near-zero values to all of thecells in the system. Attention gets narrowed down to a smaller region as training advances, and finally concentrates on eightsurrounding cells, which is the theoretical interaction range. (B) Attention weight αc of neighbors and outside cells during 200epochs of training. The attention weight of neighbor cells increases as training proceeds, while the weight of other cells remains0. Data is averaged from 100 test samples. (C) AgentNet with respect to given alive (left) and dead microstates (right). Thetotal number of alive cells in the neighborhood is denoted by s, which is the sole parameter of the CA decision rule.

tems appear in non-Markovian settings where past statesaffect the future state. In this study, we use long short-term memory (LSTM) models for the encoder and de-coder of AgentNet to capture (potential) memory effectsin the system.

The graph attention module in AgentNet explicitly as-signs variable-wise importance αq

ij by first constructing

the attention coefficient αqij from encoded data e(st) and

applying the sigmoid function to normalize the scale (seeAppendix A). We note that the choice of the sigmoidfunction is crucial because unlike most previous litera-ture [23, 27, 39] where a softmax normalization betweenagents was used (αij =

aij∑k exp(aik) ), here we aim to infer

the absolute variable-wise interaction magnitude with-out normalization among the agents. Also, further dif-fering from conventional approaches for attention coeffi-cients such as additive [45] and multiplicative [39] mech-anisms, attention coefficients in AgentNet are calculatedby multi-layer perceptrons (MLPs) (Att), which enablesmuch more flexible representations (See Appendix C andFig. 7 for the advantages of neural attention). By virtueof variable-wise separated decoder, attention weights αq

ijonly affects to q-th variable, thus one can identify in-teraction strengths for each variable by visualizing pre-dicted attention weights. Note that this is different fromwidely-known multi-headed attention since it feeds con-

catenated output into a single decoder while AgentNetdoes not concatenate the output and strictly separateseach decoder in order to impose a variable-wise transi-tion function for each attention value, not a mixed over-all transition function. In short, AgentNet clarifies theattention scheme from an unknown function of interac-tion strengths to physically interpretable variable-wisestrength. Our study is the first in-depth demonstrationof the capability of this form of graph attention scheme,achieved by comparing the attention weights for eachvariable to the ground-truth interaction strengths in var-ious simulated complex systems.

IV. RESULTS

This study utilizes three representative complex sys-tems to demonstrate the capacity of AgentNet, alongwith one empirical dataset for framework evaluation.Table II summarizes the characteristics of the modelsystems with an escalating level of complexity. All ofthe code for model training and system simulation hasbeen deposited in [46] (see Appendix A for models andbaselines implementations, and Appendix B for detaileddataset descriptions and the optimization functions forAgentNet training).

Page 6: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

6

A B C

FIG. 3. Result of AgentNet for the Vicsek model. Initially, agents are randomly distributed in a circular region with radiusR =

√5, without any boundary condition. (A) Attention weight visualization of two sample cells, a1 and a2. Both cases show

a circular sector of attention distribution with clear boundaries that perfectly matches with ground-truth interaction range.(B) Averaged attention weight for variables x and y before and after training. The fully trained AgentNet learned to identifythe neighbor agents and ignore the irrelevant others by assigning near-one and near-zero attention weights, respectively. (C)Position predictions by AgentNet for the two sample cells a1 and a2. Circles indicate the starting positions of the two particles,with the two heatmaps showing the AgentNet prediction along with the means of predicted distributions (Xs). The modelpredicts the expected theoretical distribution (crosses) with great precision, even when the given training samples (green andblue stars) are distant from the means of the theoretical distribution.

A. Cellular Automata

First, we verify AgentNet with an older yet fundamen-tal system with rich phenomena, the cellular automata(CA) model. In the CA model, each cell has its owndiscrete state, either alive or dead. Each cell interactswith its eight adjacent neighbors, and the state of eachcell evolves according to the following two rules. First,a live cell stays alive if two or three neighbor cells arealive. Secondly, a dead cell becomes alive if exactly threeneighbor cells are alive. Thus, the interaction strength ofCA can be expressed as an indicator function 1Ri

whereits value is 1 if aj ∈ Ri and 0 otherwise.

We simulate CA data in the form of a 14 × 14 grid ofcells with initially randomized states, and the state of thegrid after a single time step becomes the target label foreach data. AgentNet for CA receives three state variablesfrom each cell: positions xt and yt, and cell state ct. Theoutput here is a list of expected probabilities that eachcell becomes alive. We use the binary cross-entropy lossfunction between the AgentNet output and the ground-truth label.

Figure 2 summarizes the results of AgentNet for CA,depicting the cell state attention weight αc of the targetcell (in this case, the 102nd cell) across the entire grid.After 120 epochs, AgentNet quickly realized that a vastmajority of the cells are irrelevant to the target cell, andthereafter concentrated its attention to a more compactregion; Fig 2B shows that AgentNet gradually learns tofocus on neighbor cells only. AgentNet was able to figureout the true interaction range after 200 epochs. The re-sult of the prediction test for unseen cases showed perfect100% accuracy, as depicted in Fig. 2C.

B. Vicsek model

Next, we validate the capability of AgentNet for a con-tinuous and stochastic system. The Vicsek model [1](VM) is one of the earliest and most prominent modelsto describe an active matter system, where each agentaverages the velocity of nearby agents (including itself)to replace its previous velocity. At each time step, ev-ery agent updates its position by adding this newly as-signed velocity with stochastic noise. In this study, every300th agent in the simulation interacts with other agentswithin the range rc = 1 m and viewing angle θc = 120◦

of its heading direction. This complex interaction rangemodels the limitations of sight range and angle in realorganisms such as birds.

The model receives four state variables, positions xt

and yt, and velocities vxt and vy

t, and predicts the po-sitions of the next time step xt+1 and yt+1 in the form oftwo one-dimensional (1D) Gaussian distributions by opti-mizing the sum of two negative log-likelihood (NLL) lossfunctions. Note that each training data provides only asingle stochastically sampled value, thus putting Agent-Net for VM in the difficult condition of trying to identifythe general decision rule with only one sample for eachenvironment.

As a result, AgentNet for VM achieved a NLL lossof −1.365 for the test data, while the theoretically com-puted NLL loss was −1.524. We note that other ap-proaches, such as naive MLPs, failed to achieve meaning-ful prediction and resulted in a NLL loss of around +1.0for the VM. Figure 3A visualizes the x-variable attentionweight αx of two sample agents, a1 and a2. AgentNet forVM accurately learned the interaction boundary of the

Page 7: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

7

True trajectories Predictions

A

B E FTrue trajectories Predictions

C D

FIG. 4. Result of AgentNet for AOUP. (A) In both panels, eight steps of the test data (R = 4) of four particles are drawn withblack dots starting from the large black dots. AgentNet predictions of the trajectories in the following 12 steps (right panel)perfectly coincide with the sample trajectories from the true Langevin equation (left panel). 100 samples are drawn in bothpanels, and the final positions are highlighted with white stars. (B) Equilibrium state of a system with R = 5, which is unseenat the training stage. A single realization from the true distribution is drawn with final positions marked by red stars (left),while a single sample from the predicted distribution of AgentNet is drawn with final positions marked by blue stars (right).AgentNet for AOUP captures the generalized effect of interaction length R and predicts the collective behavior of the untrainedsystem. (C) Exerted x-directional force F int,x and x-directional velocity attention αvx shows a strong linear relationship, whilesingle attention value from GAT does not captures any of the force component. Same holds for (D), for the case of y-direction.(E) By plotting relative distance rij versus force and attention, scaled attention shows good coincidence with the force valueup to constant factor c = 0.28. (F) Visualization of αvx for a single target particle (blue). Attention and force values for (C)to (E) are collected from 100 test samples with R = 4.

given VM, which resembles a major sector of the circle.As Fig. 3B shows, the fully trained AgentNet assignsa high value to its x and y-variable attention only forneighbor agents, while the untrained AgentNet has nodistinction between neighbor and outside cells. The pre-dicted position distributions for these two sample agentsare depicted in Fig. 3C. We observe that AgentNet pre-cisely estimated the ground-truth distribution with truemean, even though the given training data is sampledfrom a stochastic distribution and did not match the ex-pected mean value. This shows the capability of Agent-Net to learn the general transition rule governing theentire set, rather than merely memorizing every singletraining datapoint and overfitting them. We also reportthat AgentNet shows the same outcome with unseen testdata.

C. Active Ornstein–Uhlenbeck Particle

Differing from the Vicsek model, some active mat-ter shows a time-correlation of particle positions due

to the force inherent in the particles that allows themto move. These systems are generally referred to asself-propelled particles, which can be described by over-damped Langevin equations for the position xi of eachparticle as

xi = µ(Fexti + Fint

i ) +√

2γTηi + fi, (6)

where µ is the mobility of the particle, γ the drag coef-ficient, and T is temperature. Here, Fext

i is the externalpotential, and Fint

i = −∇iV is the total force exertedon particle i due to the soft-core potential from otherparticles, V = exp(−|rij |3/R3), that depends on relativedistance rij and interaction length R. In this study, weuse AOUPs confined in a harmonic potential as an ex-ample system, describing the intrinsic propulsion force fias an independent Ornstein–Uhlenbeck process as

τ fi = −fi +√

2Dawi, (7)

where τ is correlation time, Da is a diffusion constant,and wi is a standard Gaussian white noise. As an ex-ternal potential, we apply a weak harmonic potential

Page 8: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

8

A

B

C Attention

𝑥𝑦-plane

𝑥𝑧-plane

𝑥

𝑦

𝑧

Velocity correlation

Attention Velocity correlation

FIG. 5. Result of AgentNet for CS. (A) Displacement errors of linear extrapolation, naive LSTM, GAT, GAT3+, and AgentNet.AgentNet shows the lowest displacement error compared to the baselines. Here, the final displacement error (FDE) of step nindicates the averaged error of birds for which their trajectories terminated at step n. All of the results are averaged value fromthree trials. (B) Exemplary snapshot of the visualized attention of a single agent (blue circle) from the test dataset. (C) Two-dimensional heatmap of averaged attention αvx and cosine similarity of the velocity with respect to the relative coordinates.We align every bird’s heading direction in the test dataset to the x-axis (blue arrow) and draw cross-sections in the xy-plane(upper panels) and xz-plane (lower panels). Different from the velocity correlation, attention shows more concentrated andstrongly directional distributions that coincide with previous literature about the bird’s visual frustum and sight direction. Forattention, a contour of the top 0.01% of the attention value is visualized (red, dashed) as well as the direction of the maximumattention value (red arrows).

Fexti = −kxi with spring constant k = 0.1 to confine

the particles, as broadly assumed and experimentally em-ployed [47]. This model is known to exhibit a collectiveclustering phenomenon, with the periodicity of the result-ing hexagonal pattern known to be approximately 1.4Rwith no Fext [48].

AgentNet for AOUP adopts an LSTM model as an en-coder to enable iterative data generation. The model ob-serves 8 steps of trajectories as input data, and the loss iscalculated for the next 12 steps. The model receives fourstate variables, xt,yt,vx

t, and vyt, and global variable R

ranging from 2.0 to 4.0, and predicts the parameters forfour 1D Gaussian distributions, similar to the AgentNetfor VM (see Appendix B for details). Note that the in-ternal variable, fi, which has its own Ornstein–Uhlenbeckdynamics, is not present in the input data and thus theneural network has to infer this hidden variable by eightsteps of past trajectory.

First, we compare the average displacement error(ADE) and final displacement error (FDE) of our modelamong 12 predicted steps as in previous works [21, 22]along with a linear extrapolation and naive LSTM with-

out the graph attention core as baselines. AgentNet forAOUP showed ADE/FDE of 0.041/0.064, while linear ex-trapolation and LSTM showed much lower performancesof 0.210/0.465 and 0.158/0.316, respectively. The per-formance of our model also exceeds the modern archi-tectures like GAT3+ (GAT with 3-headed attention andtransformer architecture), which showed the performanceof 0.065/0.087. Figure 4 summarizes the result of Agent-Net for AOUP. Our model precisely predicted the futuretrajectories subject to the past states, as depicted in Fig.4A where 100 trajectories sampled from the ground-truthLangevin equation and AgentNet for AOUP are drawn.Figure 4B shows that AgentNet is also capable of pre-dicting the untrained region of the global variable R andfurther exhibits a collective behavior that occurs far be-yond the trained time scale. Since our model can iter-atively predict future states indefinitely, we tested ourmodel to predict a total of 42 steps, which is 30 moresteps than the model was originally trained for. Surpris-ingly, our model predicts a precise hexagonal pattern ofperiodicity 7, which coincides with the theoretical valueof periodicity when R = 5. This verifies a generalization

Page 9: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

9

capability since the model had never been trained in theR = 5 condition and yet still properly captured the col-lective phenomenon, which only occurs at a much longertimescale than its training data had.

Moreover, we demonstrate that the attention αq cor-responds to the internal force F int,q, up to a constantfactor, as we claimed in system formulation section. Fig.4C and 4D verifies this by showing the attention for xand y-directional velocity vx, vy and the magnitude of

corresponding internal force F int,x, F inty , which clearlyexhibits a strong linear relationship. This cannot beachieved by a single-valued attention from conventionalGAT, which shows a poor agreement with any of the forcecomponents. We report that the single attention valuefrom GAT tries to convey the sum or average of eachinteraction strength. In Fig. 4E, we draw the scaledattention for vx and the internal force of the x directionF int,x = ∇Vij = (−3r2

ij exp [−r3ij/R

3])/R3 versus the rel-ative distance to the target particle rij . Despite a slightdisagreement at small rij , scaled attention with constantfactor c = 0.28 well matches F int,x and therefore can beconsidered as a good approximation for interaction mag-nitude. (See Appendix D for further investigation onAOUP attention.) AgentNet for AOUP successfully pre-dicted and investigated one of the most complex systemspossessing internal potential, external potential, mem-ory effects, and stochastic noises. We note that variablesother than vx also showed similar linear relationshipswith corresponding forces (results for other variables arereported in [46].)

D. Chimney swift trajectory

Finally, we demonstrate the capability of our frame-work by predicting the empirical trajectories of a freelybehaving flock of chimney swifts (CSs). Bird flocks arerenowned for their rich diversity of flocking dynamics,for which models with various mechanisms such as ve-locity alignment and cohesion have been proposed in thelast several decades [1, 15, 35]. We employed here a por-tion of the data from [49], recorded in Raleigh, NorthCarolina, in 2014. Since half of the trajectories last lessthan 150f = 5s and 80% last less than 300f = 10s dueto occlusion and the limited sight of the camera, obser-vation data takes the form of a spatiotemporal graphwith dynamic nodes where each agent lasts a short pe-riod and then disappears. Thus, discarding non-full tra-jectories as in previous works [21, 22] would significantlyreduce the number of birds to consider at a given timestep. To handle these disjointed yet entangled pieces oftrajectories, we propose a novel inspection method thatexamines the data at every step of the LSTM to man-ually connect the hidden states from the past, excludethe nonexistent birds at a certain time, and start a newchain of hidden states from a separate neural network ifan agent newly enters the scene. While several previousapproaches could handle graphs with dynamic edges [50–

52], AgentNet is, to the best of our knowledge, the firstattempt to deal with dynamic nodes on a spatiotemporalgraph (see Appendix A for a formal explanation of theinspection scheme).

The number of total birds appearing in each set variedfrom 300 to 1800, and each trajectory in the set startedand ended at different times. The model received statevariables that exist at the current time step, producedstatistics of three-dimensional position and velocity, andthen the sampled states were fed back into the model forthe next time prediction. NLL losses were calculated atevery LSTM step for existing birds.

Figure 5 summarizes the results of AgentNet for CS.The predictive power of AgentNet is illustrated in Fig.5A, where linear extrapolation and naive LSTM showmostly similar results while AgentNet shows greatly re-duced errors at predicting longer time steps, achieves bet-ter performances than GAT and GAT3+. Figure 5B,showing the visualized attention of a typical bird, clearlyindicates the near-sighted and forward-oriented natureof the bird’s interaction range. To further verify this in-teraction range, we averaged the attention values fromthe first step of predictions according to the relative co-ordinates of the target bird. The averaged results forαvx from the entire test set are drawn in Fig. 5C alongwith the averaged cosine similarity of the velocity, whichis a commonly used measure to find the range of inter-action. The interaction range projected on the xy-planecoincides with previous literature about biological agents’visual frustum, which depends on forward-oriented sightand the relative distance from each agent [19, 32, 35, 53].Also, the bird’s z-directional attention is relatively con-centrated downwards; this predicted attention is physio-logically plausible since downward-oriented visual fieldsare widely reported in various types of birds due to theirforaging nature and the blind area from the beak [36, 54].

Interestingly, Fig. 5C shows that the velocity correla-tion on the xy-plane and xz-plane shows no particulardirectional tendency as attention does. Although manystudies employ state correlations between agents to fig-ure out the characteristics of interaction [55, 56], correla-tion might be significantly different from the interactionrange itself [57]. Different from correlations, our modelprovides a causal interaction strength since the atten-tion value is strongly connected to the predictability offuture dynamics, which is quite useful for inferring andmodeling the microdynamics of individual birds.

Our model with variable-wise attention can furtherverify important physical insights. For instance, we havefound that although the scale is different, the form of at-tention concentration is surprisingly the same regardlessof the directions (Results for other variables are reportedin [46]). This directional homogeneity strongly impliesthat the bird-bird interaction is more like a near-sightedversion of the Vicsek model, differs from the distance-based force models like AOUP which must exhibit direc-tional heterogeneity. In conclusion, AgentNet employedthe position and velocity (heading direction) of neigh-

Page 10: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

10

boring birds into its prediction, thereby showing betterprediction compared to the non-interactive baseline andqualitatively plausible interaction range.

V. CONCLUSION

This study proposed AgentNet, a generalized frame-work for the data-driven modeling of a complex system.We demonstrated the flexibility, capability, and inter-pretability of our framework with large-scale data fromvarious complex systems. Our framework is universallyapplicable to agent-based systems that are governed bypairwise interactions and for which a sufficient amountof data is available. The proposed framework can inferand visualize variable-wise interaction strength betweenagents, which could assist researchers in gaining clearerinsights into given systems and their dynamics. Further-more, AgentNet is scalable for an arbitrary number ofagents due to the nature of GNNs, thus facilitating free-form simulation of the desired system with any initialcondition. Since attention values from our model canbe directly interpretable as a variable-wise interactionstrength function, we expect that AgentNet will be use-ful in heterogeneous settings where each state variableinteracts with different neighbors.

There are a great number of domains in which Agent-Net is anticipated to exhibit its full potential. As wedemonstrated via AOUP and CS, the analysis of activematter such as bacterial cells [47, 58], animal flocks [11–15], or pedestrian dynamics [21, 22] may greatly benefitfrom our approach. Also, since GNNs were originallyproposed for data with graph structures, AgentNet mayyield data-driven models of both agent and node dynam-ics of a network by incorporating an adjacency matrixinstead of assuming a complete graph. AgentNet canretrieve the underlying graph and interaction strengthfrom data, which encompasses the research fields of epi-demic dynamics [59], network identification, and variousinverse Ising problems [6]. We could further apply differ-ent encoders and decoders to improve the performanceand include available domain knowledge. For instance,a Gaussian mixture model [43] or variational model [22]that could approximate an arbitrary distribution may besuitable to approximate multimodal or highly irregulardistributions.

One limitation of the current work is that AgentNetcannot fully capture three or higher orders of interac-tions. The pairwise assumption is nearly the only induc-tive bias we have imposed on our model, which will re-quire modification if the target system is expected to havestrong higher-order interaction. First, by increasing thenumber of message passing layers, GNNs can employ in-formation from further than one-hop neighbors and pos-sibly capture the higher-order interactions among threeor more agents. Another way to alleviate the pairwise as-sumption is to consider higher-order interactions directlyin network construction. Applying a GNN with a hy-

pergraph structure [60–62], one of the rapidly growingresearch areas in machine learning, to AgentNet wouldbe a direct extension of the current study.

We highlight the virtually unbounded scope of the pro-posed framework in this study, and hope that AgentNetshines a new light on physical modeling and helps re-searchers in diverse domains delve into their systems ina data-driven manner.

ACKNOWLEDGMENTS

This research was supported by the Basic Science Re-search Program through the National Research Founda-tion of Korea NRF-2017R1A2B3006930. We appreciateY.J. Baek for providing insights to model demonstrationand fruitful discussion.

Appendix A: Neural architecture of AgentNet andtraining details

1. AgentNet and baseline implementations

We implemented our AgentNet model with PyTorch[63]. The encoder and decoder layers of AgentNet arecomposed of multi-layer perceptrons (MLPs). The di-mension notation such as [32, 16, 1] means that the modelconsists of three perceptron layers with 32, 16, and 1neurons in each layer. Also, dims. is an abbreviation ofdimensions.

All of the encoding layers of AgentNet are composedof [Input dims, 256, Attention dims]. Here, inputdimensions are chosen as the sum of the number of statevariables and additional variables, such as global vari-ables (as in AgentNet for AOUP) or indicator variables(as in AgentNet for CS). The form of the final dimensionindicates that each output of the encoder (key, query, andvalue) will be processed separately. See [40] for more de-tails about transformer architecture.

With these outputs and (additional) global externalvariables (u), neural attention is applied to calculate at-tention value αij from encoded data e(st). First, thealgorithm constructs aqij , the attention coefficient for theq-th state variable between agents i and j, and feeds theconcatenated vectors into MLP(att) as

aqij = Att(Key(e(si)),Query(e(sj)),u) (A1)

and applies the sigmoid function

αqij =

1

1 + exp(−aqij)(A2)

where Key, Query, and Att indicate the correspondingMLPs used for transformer architecture and has dimen-sions.

Page 11: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

11

After variable-wise attentions are multiplied to theirrespective values and averaged, we concatenate the (orig-inal target agent’s) value and its averaged attention-weighted values (from others) and feed it into thevariable-wise separated decoder. Since two tensors areconcatenated, the last dimension of this tensor has twicethe length of the original dimension of the value tensor.The decoder consists of [2 × value dims., 128, outputdims.].

In the stochastic setting (VM, AOUP, and CS), thedecoded tensor further feeds into other layers to obtainsufficient statistics for the probabilistic distribution. Inthis paper, those statistics are means and variances ofstate variables. MLP layers for these values consist of[output dims., 64, corresponding number of vari-ables]. For instance, AgentNet for CS has 2 × 6 = 12separate layers to calculate means and variances for 6state variables.

In the case of a target system with probable time cor-relations, we adopted long short-term memory (LSTM)as an encoder to capture the correlations [64]. Hiddenstates and cell states have 128 dims. each and are ini-tialized by additional MLPs that are jointly trained withthe main module. As explained in the main manuscript,AgentNet checks at each time step whether an agent isnew and present. When an agent is newly entered, newLSTM hidden states are initialized. Otherwise, hiddenstates succeed from the previous result.

For the baseline, we employed a MLP, LSTM, andGAT model where the variable-wise graph attentionmodule is missing. For MLP and LSTM, We doubledthe number of layers and neurons of the decoder to com-pensate for the missing attention module, which its de-coder consists of [2 × value dims., 256, 256, out-put dims.]. For standard GAT [27], we left everythingthe same as AgentNet and replaced variable-wise atten-tion core to original graph-attention core with linear pro-jection matrices of [Attention dims., 128] and inner-product attention was used with those 128-dimensionalvectors. For GAT3+, we implemented multi-headed at-tention (with 3 heads) with dimensions of 12 (for AOUP)and 32 (for CS), and every other module is the same asAgentNet. This choice is to (Note that even for GATand GAT3+, we used [Input dims, 256, Attentiondims] dimensions of encoding layers for the key, queryand value, instead of linear projection matrices as theoriginal architecture.)

2. Inspection scheme of AgentNet for CS

As the author of the original paper noted [32], CS datacontains a lot of short trajectories due to frequent oc-clusion and limitation of the camera viewing angle. Anexemplary snapshot of typical trajectory data is drawnin Fig. 6, where all of the trajectories are not aligned intime and have different lengths. This form of unstruc-tured data can be expressed as a dynamic graph that

𝑡 = 𝑡0𝑡

Bird # Ⓐ

ℎ𝑡0−1 ℎ𝑡0

Inspection Algorithm

FIG. 6. Trajectory data schema of the chimney swift flock[32]. (left) In each time step, each agent (bird) could take oneof three states: (A) it continues to exist, (B) it disappears,or (C) it newly enters the scene. (right) The hidden states ofthe agents for LSTM change by the inspection algorithm, cor-rectly removing vanished agents and introducing new agentswith initialized hidden states. Red crosses indicate dummyvariables (zeros) for padding.

precludes the naive application of a graph neural net-work (GNN) [65] with LSTM [64] states since the neigh-bor of each node changes dynamically and each node hasa different starting time. Hence, we need an inspectionmethod to manually check the hidden state at each timestep and update each node’s status to the appropriateform.

Figure 6 describes all three possible cases that our in-spection scheme needs to deal with. Case A is an ordi-nary case with a continued state, so we can simply handover the previously updated hidden state to the next iter-ation. Case B means that the agent, which existed at thelast time step (t = t0−1), disappeared and should not beconsidered as a valid neighbor anymore. Our inspectionmethod excludes such data and creates a mask for theattention matrix to ensure that the attention between avalid agent and non-existent agent should be strictly zeroat any time. Finally, case C indicates a newly enteredagent whose hidden states should be initialized before itstarts its chain of hidden states.

In practice, input data is created from raw data in theform of [time step, number of total agents, numberof state variables]. Note that we use the number oftotal agents for every time step, which is not always thesame as the number of agents at the specific time. Forinstance, in Fig. 6, there are 7 birds present in the totalscene, but only 6 birds are present at t = t0. Insteadof changing the data size every time step, we set everydimension to the maximum agent number (in this case,7) and fill the currently non-existing birds’ state variableswith dummy values (we used zeros). With such padding,we can create a batched data set that enables parallelcalculations to speed up the GPU deep learning. But wehave to carefully mask them out properly at each forwardpass to ensure that none of these dummy data affects theresults because we keep calculating and updating these

Page 12: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

12

dummy data as well as the real agents’ data.

Our inspection algorithm creates two masks withlength of total agents, which is called ’Now mask’ mnow

and ’New mask’ mnew. Each mask checks the existenceof agents at the current(t = t0) and next time step(t = t0 + 1), and assigns 1 if the corresponding condi-tion is satisfied and 0 otherwise. ’Current mask’ assigns1 if the agent is present at t0. ’New mask’ assigns 1 if theagent is not present at t0 but appears in t0 + 1. Inspec-tion method employs these three masks to control hiddenstate updates, attention calculations, and state variablesupdates.

At the start of every iteration, the algorithm checksnew agents that need to be initialized. Let us denotethe hidden states of agents from time t as ht and stateinput data as xt. Then, the hidden state update can beexpressed as

ht+1 = ht � (1−mnew) + I(xt)�mnew (A3)

where I(x) indicates the hidden state initializationmodule consisting of neural layers, and �means element-wise multiplication. Then, after the attention calcu-lation, the algorithm casts a two-dimensional attentionmask matt on raw attention matrix αq

raw to exclude allof the attention values between real and dummy agentsas follows:

matt,ij =

{1 if ai and aj are both real agent

0 if otherwise(A4)

and αq = αqraw � matt. We can construct matt from

the ’Now mask’ by repeating row vector mask mnow forthe column dimension to make expanded matrix maskMnow, with matt = Mnow ×MT

now.

3. Training scheme

All training used 2 to 10 NVIDIA TITAN V GPUs,with which the longest training for a single model tookless than two days. Mish activation function [66] with aform of f(x) = xtanh(softplus(x)) and the Adam Opti-mizer [67] were used for the construction of models andtraining. The learning rate was set to 0.0005 and de-creased to 70% of the previous value when the test lossremained steady for 30 epochs. In the case of AgentNetfor CS, we employed weighted NLL loss for different timesteps, in which weights are inversely proportional to thefrequency of the sample with a given trajectory length, toresolve the imbalance of available trajectory length. Ta-ble III shows further details of the model for each system,including the number of attention heads.

Appendix B: Dataset Details

1. Cellular automata

We employed the cellular automata (CA) model basedon the rule of Conway’s life game [30]. As described inthe main manuscript, CA takes place on regular gridswith each cell on the grid altering its cell state at eachtime step according to a specific set of rules. These rulesare often notated as B3/S23, which means a dead cellregenerates with three neighboring live cells, while a livecell stays alive with two or three neighboring live cells.In this study, we used a 14 × 14 grid with uniformlyrandom initial cell states and updated an inner squaregrid of 12× 12 to avoid the periodic boundary problem.We simulated 1,000 sets of samples for demonstration,comprising 800 samples for training and 200 samples fortesting. We report that fewer samples such as 500 or 300also resulted in a perfectly trained model with 100% testaccuracy. Each target cell along with its eight neighborcells yields a total of 29 = 512 possible microstates. Welocated a 3 × 3 microstate template at a random posi-tion on the grid cell, initialized to random cell states,and produced AgentNet output. If the model correctlylearned the transition rule, it would result in theoreticaloutput assigned by the transition rule of CA regardlessof its position and other irrelevant cell states.

From the perspective of AgentNet, CA is a binary clas-sification problem, i.e. whether each cell becomes aliveor dead at the next time step. AgentNet for CA re-ceives three state variables: positions xt and yt, and cellstate ct.The output here is a list of expected probabili-ties that each cell becomes alive. We use the binary crossentropy loss function between the AgentNet output andthe ground-truth label as

LBCE(s) =∑i

yi log ci + (1− yi) log (1− ci), (B1)

where ci is the predicted ith cell state of output s, andyi is the corresponding true label of the ith cell.

2. Vicsek model

The Vicsek model (VM) [1] assumes that flocking oc-curs due to velocity alignment with neighbors. Amongthe many variants, we implemented the simplest modelwith alignment terms and positional Gaussian noise.(Note that this differs from the originally proposed model[1], which used angular Gaussian noise instead.) In VM,the ith agent interacts with the jth agent if the distancebetween the two agents, r(sti, s

tj), is smaller than a certain

range, rc, and the absolute value of the angle between theheading direction of the ith agent and the position of thejth agent, θ(sti, s

tj), is smaller than a certain angle, θc.

This interaction range models the sight range limit ofliving organisms, resulting in a circular sector form. In

Page 13: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

13

TABLE III. Implementation details of the models for sample systems. ∗ : global variable (R), ∗∗ : indicator variable

System Input data dims. Output data dims. Attention head dims. Layer composition1. Cellular Automata 3 1 16 [32, 32, 16, 1]2. Vicsek Model 4 2 16 [32, 64, 32, 1]3. Active Ornstein–Uhlenbeck Particle 4 4 16 [32+1∗, 16, 8, 1]4. Chimney Swift 6 + 1∗∗ 6 96 [192, 16, 8, 1]

VM, the ith agent averages the velocity among its inter-acting neighbors Ri and adds Gaussian noise N (0, σ) tocompute its velocity vt+1

i , described as follows:

vt+1i = (vt

i +∑

aj∈Ri

vtj)/(|Ri|+ 1) +N (0, σ).

This formula is equivalent to (∑

R∗ivi)/|R∗i |+N (0, σ)

where R∗i = Ri ∪ ai.AgentNet for VM aims to predict the position of the

next time step, which is a sum of the current position andcalculated velocity. We simulated 2,000 sets of samplesfor demonstration, comprising 1,600 samples for trainingand 400 samples for testing. In our simulation, rc = 1,θc = 120◦, and standard deviation of noise σ = 0.2.

The model infers the parameters of two 1D Gaussiandistributions, which is means (µx, µy) and σx, σy. Wecalcualte the sum of the NLL loss function for Gaussiandistribution to train the AgentNet for VM as

LNLL(s) =∑q

∑i

−1

2log(2σq

i ) +(yq

i − µqi )2

2(σqi )2

, (B2)

where µqi and σq

i are the predicted statistics of variableq of the ith agent, and yq

i is the corresponding label ofthe ith cell. In the VM case, q = {x, y}.

3. Active Ornstein–Uhlenbeck particle

For the AOUP dataset, we simulated 8,000 sets oftraining data and label pairs of 100 particles. All par-ticles were uniformly spread on a circle of radius 5, andinitial speeds were sampled from uniform distributionU(0, 0.05). We implemented the Euler–Maruyama [68]method with timestep dt = 0.01 to numerically simulatethe AOUP trajectories. The data and label points werefurther subsampled from the simulated trajectory with afrequency of 10 Hz, which means every 1 out of 10 subse-quent data points were chosen. In terms of real-time, ourmodel receives 0.8 s of data observation time and thenpredicts the following 1.2 s of trajectory. The followingconstants were adopted for simulation: γ = 1, τ = 0.5,k = 0.1, Da = 0.02, and T = 0.2. We have found thatour model is robust against change in system constants,showing similar performance with different constant val-ues.

We attached a constant vector R to the attention vec-tor to open the possibility that interaction function hpairdepends on the global variable R. (In case of AOUPs,this is the case since R affects interaction potential Fint.)Differing from the VM case, AgentNet for AOUP yields atotal of 8 parameters for each agent, namely means andvariances for positions xt+1, yt+1 and velocities vx

t+1,vy

t+1. This is necessary because input and output statevariables should be the same in order to iteratively sam-ple from the predicted distribution and feed it into theprediction at the next step.

We employed teacher forcing [69] to train the LSTM-based AgentNet, which is a technique that feeds ground-truth labels into subsequent LSTM cells instead of sam-pled output in the early stages of training. This is use-ful to stabilize the training of trajectory prediction sincethe prediction depends on the last output which typi-cally explodes to meaningless values in early, untrainedstages. We set an initial epoch of 50 as the teaching pe-riod such that the possibility of using the ground-truthlabel is 1 − epoch/50. This gradually decreasing possi-bility becomes 0 at epoch 50, after which ground-truth isnot used.

For evaluation, linear extrapolation along with naiveLSTM are selected as baselines. In the case of extrap-olation, x and y coordinates of the previous 8 steps areextrapolated through time and the next 12 consecutivesteps are recorded. The naive LSTM used the same statevariables (x, y, vx, vy) and the same output structure, butonly contained the LSTM encoder and MLP decoder (i.e.was missing an attention core). This implies that no ef-fects of interactions with others are considered by thenaive LSTM. Averaged displacement error (ADE) is cal-culated by taking the average of Euclidean distances fromground-truth to predicted coordinates for all 12 steps. Fi-nal displacement error (FDE) only takes the averages ofthe final (12th step) error.

4. Chimney swift flock trajectory

The original paper [32] aimed to focus on collectivebehavior during the landing sequence of a chimney swift(CS) flock. The flock data contains 30 min of observedCS trajectories at 30 frames (=f) per second, with ap-proximately 100,000 unique trajectories and a maximumof 1,848 birds at one instance. The reason that the num-ber of unique trajectories greatly exceeds the size of theflock is because many of the trajectories from the same

Page 14: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

14

birds are treated separately if (1) they escape from thesight of the cameras and later re-enter, or (2) the birdsare occluded by other birds thus introducing ambiguity.The whole dataset can be divided into three parts: (1) aninitial stage where the birds are starting to gather, (2) anintermediate stage where the flock forms with birds show-ing collective spinning, and (3) the final stage of landingon the chimney. In this study, we used the first portion(file A) of the data since we are interested in general birdflocking rather than a specific landing sequence (file C),and the second part (file B) contained more than 3000unique birds in a 5 s instance, generally exceeding thememory capacity of a GPU (12 GB) even for the case ofa single data per batch. Among 30 min (= 54000 f) oftrajectory data, we employed around 10 min (= 18000 f)of frames and constructed a dataset with 30f ∗ 10 = 300feach, 20f apart from each other. Although a single data-point spans an overlapped time range with other data, westrictly split the training and test dataset to remove anypossibility of data contamination. Exact details aboutthe dataset and statistics can be found in [32].

In the CS case, there is a problem with unbalancedlabels since not every sample has full 10-step (10 s) tra-jectories, as mentioned in the main manuscript. Thus,we checked the number of agents present at certain stepsand calculated a weighted loss to strengthen the effect ofcases with fewer birds (typically, the case of a higher timestep has a smaller number of constituents since many tra-jectories end early).

AgentNet for CS was trained with 530 sets of bird tra-jectories that spanned 300 frames each, split into 10 timesteps. Since every trajectory starts and ends at differenttime steps, we take an average of bird displacement errorwhere its starting point is regarded as time step 0. Forinstance, if the trajectory of bird 1 spans from time step0 to 3 and the one of bird 2 spans from 4 to 6, the errorcalculated at bird 1, step 2 and bird 2, step 6 will betreated the same as the displacement error at the timestep 2 since both steps are 2nd steps to their startingpoint. Also, since there were trajectories only lasting 2time steps (= 1.0s), we cannot apply linear regression forthose trajectories. Instead, we linearly extrapolated thetrajectories by employing the given velocity at the firsttime step. Also, we intentionally scaled the input data bymultiplying 0.1 for every variable to stabilize the trainingwhile preserving the relative difference of variable mag-nitude.

Appendix C: Advantages of Neural Attention

Neural attention requires a greater number of param-eters to optimize, but showed greater performance inpredicting complex agents compared to existing atten-tion mechanisms, as shown in the main manuscript. Onepossible reason for the outperformance of neural atten-tion is that the interaction range of a complex systemis far more nonlinear and complicated than conventional

BA

FIG. 7. (A) Functions of a smoothed version of interactionrange from sigmoid models with various smoothing param-eter b. (B) Performance comparison of VAINS, GAT, andAgentNet compared to NLL loss from groundtruth.

C

A B

FIG. 8. PCA result of value vectors v from AgentNet forAOUP. (A) x-directional force F int,x abd attention for vx.Note that attention learns magnitude not the sign, due otits non-negative constraint. (B) Explained variance of eachprincipal component sorted by rank, which shows four clearlydistinguished components. (B) Each principal component andcorresponding x or y coordinate data.

Page 15: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

15

attention mechanisms can handle, which are relativelyunder-parametrized for such interaction ranges. This as-sumption is further supported if the performance gapdecreases as the interaction boundary becomes more lin-ear. We experimented with a smoothed version VM (Fig.7A), where interaction strength is defined as a sigmoidfunction s(x) = 1/(1 + exp[b(x − a)]). Figure 7B showsthe performances of VAIN (exponential-based attentionfrom [33]), GAT (conventional transformer), and Agent-Net (neural attention) with different b when a = 1. Notethat the original interaction boundary coincides with theb→∞ case. Results illustrate the difference between thescales of the two mechanisms for b, which underpins theaforementioned nonlinearity hypothesis.

Appendix D: AOUP attention analysis

The variable-wise attention in AgentNet is wrappedby the sigmoid function, where its output is equal to orgreater than 0, as seen in Fig. 7A. Hence, relative posi-tional information is vital to convert the magnitude intoactual force with a sign, which can be directly appliedto change the velocity. For instance, force should be +xdirection if the target agent i is in (1, 0) and the neigh-bor agent j is in (0, 0), but it changes to −x direction ifthe neighbor agent moves to (2, 0), although the magni-

tude (and thus the attention value) would be the same.It is possible to regain this directional information if thevalue vector vj , corresponding to a leftover function, con-tains the positional data of the jth agent; the positioninformation of the ith agent can be delivered by hself.

Figure 7 verifies that value vector v preserves the po-sitional information of the input data. We gathered datafrom 100 test samples of AOUP, and performed princi-pal component analysis (PCA) to 16-dimensional vectorv from the encoder. Figure 7A shows that the top fourprincipal components (PCs) stand out in terms of ex-plained variance, compared to the PCs with lower ranks.We found that these four PCs have an interesting nonlin-ear relationship with two coordinates x and y from inputdata. Figure 7B shows that these four PCs have strong si-nusoidal relationships with x and y, giving clear evidenceof information preservation. We note that this specificnon-linear transformation is not special, and there couldbe various other possible means of preserving positionalinformation. (For instance, the simplest way to conveypositional information would be to directly preserve thecoordinate data from the input without any nonlineartransformation.) In this particular case, we can assumethat the overall function f could successfully retrieve theoriginal coordinate from these sinusoidal functions andultimately obtain the sign of the relative position by com-paring it with the target ith agent’s position.

[1] T. Vicsek, A. Czirok, E. Ben-Jacob, I. Cohen, andO. Shochet, Novel type of phase transition in a systemof self-driven particles, Physical Review Letters 75, 1226(1995).

[2] J. Bongard and H. Lipson, Automated reverse engineer-ing of nonlinear dynamical systems, Proceedings of theNational Academy of Sciences 104, 9943 (2007).

[3] M. Schmidt and H. Lipson, Distilling free-form naturallaws from experimental data, science 324, 81 (2009).

[4] P. Y. Lu, S. Kim, and M. Soljacic, Extracting inter-pretable physical parameters from spatiotemporal sys-tems using unsupervised learning, Phys. Rev. X 10,031056 (2020).

[5] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton,Data-driven discovery of coordinates and governing equa-tions, Proceedings of the National Academy of Sciences116, 22445 (2019).

[6] H. C. Nguyen, R. Zecchina, and J. Berg, Inverse statis-tical problems: from the inverse ising problem to datascience, Advances in Physics 66, 197 (2017).

[7] V. I. Gorbachenko, T. V. Lazovskaya, D. A. Tarkhov,A. N. Vasilyev, and M. V. Zhukov, Neural networktechnique in some inverse problems of mathematicalphysics, in International Symposium on Neural Networks(Springer, 2016) pp. 310–316.

[8] T. Wu and M. Tegmark, Toward an artificial intelligencephysicist for unsupervised learning, Physical Review E100, 033311 (2019).

[9] H. Li, X.-q. Shi, M. Huang, X. Chen, M. Xiao, C. Liu,H. Chate, and H. Zhang, Data-driven quantitative mod-

eling of bacterial active nematics, Proceedings of the Na-tional Academy of Sciences 116, 777 (2019).

[10] D. P. Solomatine and A. Ostfeld, Data-driven modelling:some past experiences and new approaches, Journal ofhydroinformatics 10, 3 (2008).

[11] J. Buhl, D. J. Sumpter, I. D. Couzin, J. J. Hale, E. Des-pland, E. R. Miller, and S. J. Simpson, From disorder toorder in marching locusts, Science 312, 1402 (2006).

[12] J. G. Puckett, D. H. Kelley, and N. T. Ouellette, Search-ing for effective forces in laboratory insect swarms, Sci-entific reports 4, 4766 (2014).

[13] Y. Katz, K. Tunstrøm, C. C. Ioannou, C. Huepe, andI. D. Couzin, Inferring the structure and dynamics ofinteractions in schooling fish, Proceedings of the NationalAcademy of Sciences 108, 18720 (2011).

[14] M. Ballerini, N. Cabibbo, R. Candelier, A. Cavagna,E. Cisbani, I. Giardina, V. Lecomte, A. Orlandi,G. Parisi, A. Procaccini, et al., Interaction ruling an-imal collective behavior depends on topological ratherthan metric distance: Evidence from a field study, Pro-ceedings of the national academy of sciences 105, 1232(2008).

[15] W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Sil-vestri, M. Viale, and A. M. Walczak, Statistical mechan-ics for natural flocks of birds, Proceedings of the NationalAcademy of Sciences 109, 4786 (2012).

[16] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti,D. Raposo, A. Santoro, R. Faulkner, et al., Relational in-ductive biases, deep learning, and graph networks, arXiv

Page 16: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

16

preprint arXiv:1806.01261 (2018).[17] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al.,

Interaction networks for learning about objects, relationsand physics, in Advances in neural information processingsystems (2016) pp. 4502–4510.

[18] W. Gilpin, Cellular automata as convolutional neuralnetworks, Physical Review E 100, 032402 (2019).

[19] F. J. Heras, F. Romero-Ferrero, R. C. Hinz, and G. G.de Polavieja, Deep attention networks reveal the rules ofcollective motion in zebrafish, PLoS computational biol-ogy 15, e1007354 (2019).

[20] K. T. Schutt, A. Tkatchenko, and K.-R. Muller, Learningrepresentations of molecules and materials with atom-istic neural networks, arXiv preprint arXiv:1812.04690(2018).

[21] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, Social lstm: Human trajectory pre-diction in crowded spaces, in Proceedings of the IEEEconference on computer vision and pattern recognition(2016) pp. 961–971.

[22] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, andA. Alahi, Social gan: Socially acceptable trajectorieswith generative adversarial networks, in Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (2018) pp. 2255–2264.

[23] A. Vemula, K. Muelling, and J. Oh, Social attention:Modeling attention in human crowds, in 2018 IEEEInternational Conference on Robotics and Automation(ICRA) (IEEE, 2018) pp. 1–7.

[24] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, andR. Zemel, Neural relational inference for interacting sys-tems, arXiv preprint arXiv:1802.04687 (2018).

[25] D. Mrowca, C. Zhuang, E. Wang, N. Haber, L. F. Fei-Fei,J. Tenenbaum, and D. L. Yamins, Flexible neural repre-sentation for physics prediction, in Advances in NeuralInformation Processing Systems (2018) pp. 8799–8810.

[26] S. Seo and Y. Liu, Differentiable physics-informed graphnetworks, arXiv preprint arXiv:1902.02950 (2019).

[27] P. Velickovic, G. Cucurull, A. Casanova, A. Romero,P. Lio, and Y. Bengio, Graph attention networks, arXivpreprint arXiv:1710.10903 (2017).

[28] G. Cybenko, Approximation by superpositions of a sig-moidal function, Mathematics of control, signals and sys-tems 2, 303 (1989).

[29] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S.Dickstein, On the expressive power of deep neural net-works, in Proceedings of the 34th International Confer-ence on Machine Learning-Volume 70 (JMLR. org, 2017)pp. 2847–2854.

[30] M. Gardner, Mathematical games, Scientific American222, 132 (1970).

[31] L. Caprini, E. Hernandez-Garcıa, C. Lopez, and U. M. B.Marconi, A comparative study between two models ofactive cluster-crystals, arXiv preprint arXiv:1906.03016(2019).

[32] D. J. Evangelista, D. D. Ray, S. K. Raja, and T. L.Hedrick, Three-dimensional trajectories and networkanalyses of group behaviour within chimney swift flocksduring approaches to the roost, Proceedings of the RoyalSociety B: Biological Sciences 284, 20162602 (2017).

[33] Y. Hoshen, Vain: Attentional multi-agent predictivemodeling, in Advances in Neural Information ProcessingSystems (2017) pp. 2701–2711.

[34] S. Sukhbaatar, R. Fergus, et al., Learning multiagent

communication with backpropagation, in Advances inNeural Information Processing Systems (2016) pp. 2244–2252.

[35] C. K. Hemelrijk and H. Hildenbrandt, Some causes ofthe variable shape of flocks of birds, PloS one 6, e22479(2011).

[36] G. Martin and J. Shaw, Bird collisions with power lines:failing to see the way ahead?, Biological Conservation143, 2695 (2010).

[37] D. P. Landau and K. Binder, Phase diagrams and criti-cal behavior of ising square lattices with nearest-, next-nearest-, and third-nearest-neighbor couplings, PhysicalReview B 31, 5946 (1985).

[38] R. A. Holley and T. M. Liggett, Ergodic theorems forweakly interacting infinite systems and the voter model,The annals of probability , 643 (1975).

[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,Attention is all you need, in Advances in neural informa-tion processing systems (2017) pp. 5998–6008.

[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,Attention is all you need, in Advances in neural informa-tion processing systems (2017) pp. 5998–6008.

[41] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio, Generative adversarial nets, in Advances in neuralinformation processing systems (2014) pp. 2672–2680.

[42] D. P. Kingma and M. Welling, Auto-encoding variationalbayes, arXiv preprint arXiv:1312.6114 (2013).

[43] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville,and Y. Bengio, A recurrent latent variable model for se-quential data, in Advances in neural information process-ing systems (2015) pp. 2980–2988.

[44] A. Graves, Generating sequences with recurrent neuralnetworks, arXiv preprint arXiv:1308.0850 (2013).

[45] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, andY. Bengio, Attention-based models for speech recogni-tion, in Advances in neural information processing sys-tems (2015) pp. 577–585.

[46] Code for models, system simulations, and supplemen-tary figures for attention visualizations are deposited athttps://github.com/nokpil/agentnet (2020).

[47] C. Maggi, M. Paoluzzi, N. Pellicciotta, A. Lepore, L. An-gelani, and R. Di Leonardo, Generalized energy equipar-tition in harmonic oscillators driven by active baths,Physical Review Letters 113, 238303 (2014).

[48] J.-B. Delfau, H. Ollivier, C. Lopez, B. Blasius, andE. Hernandez-Garcıa, Pattern formation with repulsivesoft-core interactions: Discrete particle dynamics anddean-kawasaki equation, Physical Review E 94, 042120(2016).

[49] D. J. Evangelista, D. D. Ray, S. K. Raja, and T. L.Hedrick, Data from: Three-dimensional trajectories andnetwork analyses of group behaviour within chimneyswift flocks during approaches to the roost (2018).

[50] Y. Ma, Z. Guo, Z. Ren, E. Zhao, J. Tang, andD. Yin, Dynamic graph neural networks, arXiv preprintarXiv:1810.10627 (2018).

[51] F. Manessi, A. Rozza, and M. Manzo, Dynamic graphconvolutional networks, Pattern Recognition 97, 107000(2020).

[52] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura,H. Kanezashi, T. Kaler, and C. E. Leisersen, Evolvegcn:

Page 17: arXiv:2001.02539v4 [cond-mat.stat-mech] 12 Nov 2020

17

Evolving graph convolutional networks for dynamicgraphs, arXiv preprint arXiv:1902.10191 (2019).

[53] K. Smith, S. O. Ba, J.-M. Odobez, and D. Gatica-Perez,Tracking the visual focus of attention for a varying num-ber of wandering people, IEEE transactions on patternanalysis and machine intelligence 30, 1212 (2008).

[54] G. Martin, Bird collisions: a visual or a perceptualproblem, BOU Proceedings–Climate Change and Birds.http://www. bou. org. uk/bouproc-net/ccb/martin. pdf(2010).

[55] R. Ni and N. Ouellette, Velocity correlations in labora-tory insect swarms, The European Physical Journal Spe-cial Topics 224, 3271 (2015).

[56] M. Nagy, Z. Akos, D. Biro, and T. Vicsek, Hierarchicalgroup dynamics in pigeon flocks, Nature 464, 890 (2010).

[57] A. Cavagna, A. Cimarelli, I. Giardina, G. Parisi, R. San-tagati, F. Stefanini, and M. Viale, Scale-free correlationsin starling flocks, Proceedings of the National Academyof Sciences 107, 11865 (2010).

[58] A. Sokolov and I. S. Aranson, Physical properties of col-lective motion in suspensions of bacteria, Physical Re-view Letters 109, 248109 (2012).

[59] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, andA. Vespignani, Epidemic processes in complex networks,Reviews of Modern Physics 87, 925 (2015).

[60] Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, Hyper-graph neural networks, in Proceedings of the AAAI Con-ference on Artificial Intelligence, Vol. 33 (2019) pp. 3558–

3565.[61] S. Bai, F. Zhang, and P. H. Torr, Hypergraph con-

volution and hypergraph attention, arXiv preprintarXiv:1901.08150 (2019).

[62] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E.Lenssen, G. Rattan, and M. Grohe, Weisfeiler and le-man go neural: Higher-order graph neural networks, inProceedings of the AAAI Conference on Artificial Intel-ligence, Vol. 33 (2019) pp. 4602–4609.

[63] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang,Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer,Automatic differentiation in PyTorch, in NIPS AutodiffWorkshop (2017).

[64] S. Hochreiter and J. Schmidhuber, Long short-term mem-ory, Neural computation 9, 1735 (1997).

[65] T. N. Kipf and M. Welling, Semi-supervised classifica-tion with graph convolutional networks, arXiv preprintarXiv:1609.02907 (2016).

[66] D. Misra, Mish: A self regularized non-monotonic neu-ral activation function, arXiv preprint arXiv:1908.08681(2019).

[67] D. P. Kingma and J. Ba, Adam: A method for stochasticoptimization, arXiv preprint arXiv:1412.6980 (2014).

[68] R. Toral and P. Colet, Stochastic numerical methods: anintroduction for students and scientists (John Wiley &Sons, 2014).

[69] R. J. Williams and D. Zipser, A learning algorithm forcontinually running fully recurrent neural networks, Neu-ral computation 1, 270 (1989).