Temporal Neighbourhood Aggregation: Predicting Future Links in Temporal Graphs via Recurrent Variational Graph Convolutions

Graphs have become a crucial way to represent large, complex and often temporal datasets across a wide range of scientific disciplines. However, when graphs are used as input to machine learning models, this rich temporal information is frequently disregarded during the learning process, resulting in suboptimal performance on certain temporal inference tasks. To combat this, we introduce Temporal Neighbourhood Aggregation (TNA), a novel vertex representation model architecture designed to capture both topological and temporal information to directly predict future graph states. Our model exploits hierarchical recurrence at different depths within the graph to enable exploration of changes in temporal neighbourhoods, whilst requiring no additional features or labels to be present. The final vertex representations are created using variational sampling and are optimised to directly predict the next graph in the sequence. Our claims are supported by experimental evaluation on both real and synthetic benchmark datasets, where our approach demonstrates superior performance compared to competing methods, outperforming them at predicting new temporal edges by as much as 23% on real-world datasets, whilst also requiring fewer overall model parameters.

Abstract-Graphs have become a crucial way to represent large, complex and often temporal datasets across a wide range of scientific disciplines. However, when graphs are used as input to machine learning models, this rich temporal information is frequently disregarded during the learning process, resulting in suboptimal performance on certain temporal inference tasks. To combat this, we introduce Temporal Neighbourhood Aggregation (TNA), a novel vertex representation model architecture designed to capture both topological and temporal information to directly predict future graph states. Our model exploits hierarchical recurrence at different depths within the graph to enable exploration of changes in temporal neighbourhoods, whilst requiring no additional features or labels to be present. The final vertex representations are created using variational sampling and are optimised to directly predict the next graph in the sequence. Our claims are supported by experimental evaluation on both real and synthetic benchmark datasets, where our approach demonstrates superior performance compared to competing methods, outperforming them at predicting new temporal edges by as much as 23% on real-world datasets, whilst also requiring fewer overall model parameters.
Index Terms-representation learning, dynamic link prediction

I. INTRODUCTION
Using graphs to represent relationships in large, complex and high-dimensional datasets has become a universal phenomenon across many scientific fields. Encompassing not only computer scientists, interested in social and citation networks [1], but biologists, studying protein interaction graphs for associations with diseases [2], chemists, who model molecule properties by treating them as graphs [3], and physicists, who use graphs to model a physical environment [4].
Using graph-based approaches enables complex data analysis, with one of the most universal being the identification of missing links within the graph, which can provide invaluable insight in many real-world scenarios. For example, the recommendation of acquaintances on social networks, new research papers to read or even new links between molecules. However, Fig. 1: The temporal link prediction task is to predict the new edges (red) in the final graph snapshot G T (green plane) given the previous graphs G 1 and G 2 .
to date, almost all of the prediction work performed on graphs has been focused on analysis in solely the topological domain, ignoring the rich temporal information inherent in so much of the data represented by graphs (as seen Figure 1).
We formally define a graph G = (V, E) as a finite set of vertices V , with a corresponding set of edges E. Elements of E are unordered tuples {i, j} where i, j ∈ V . Elements in V and E may have labels or certain associated features, although these are not required for this work. In order to perform analysis on graphs, we need a mechanism which converts the formal graph representation into a format which is amenable for machine learning -graph representation learning.
The field of graph representation learning has received significant attention as a means of analysing large, complex graphs via the use of machine learning. Graph representation learning, comprises a set of techniques that learn latent representations of a graph, which can then be used as the input 978-1-7281-0858-2/19/$31.00 © 2019 IEEE to machine learning models for downstream prediction tasks [5]. The majority of graph representation learning techniques have focused upon learning vertex embeddings [6] and reconstructing missing edges [5]. As such, the goal of graph representation learning is to learn some function f : V → R d which maps from the set of vertices V to a set of embeddings of the vertices, where d is the required dimensionality. This results in f being a mapping from G to a representation matrix of dimensions |V | × d, i.e. an embedding of size d for each vertex in the graph. However, the majority of graph representation learning approaches to date ignore the temporal aspect of dynamic graphs, resulting in models which perform poorly at predicting future change in a graph.
This paper introduces a new model, entitled Temporal Neighbourhood Aggregation (TNA), designed to learn vertex representations which capture both topological and temporal change by exploiting the rich information found in large dynamic graphs. To achieve this, we propose a novel model architecture combining graph convolutions with recurrent connections on the resulting vertex level representations to allow for powerful, hierarchical learning at multiple hops of a vertices neighbourhoods. This approach means the model can explore at which neighbourhood depth the most useful temporal information can be learned. Further, we aggregate the temporal neighbourhood using tools from variational inference, resulting in a more robust and stable final representation for each vertex. Our TNA model is trained end to end on temporal graphs represented as time snapshots, where the objective is to directly and accurately predict the next graph in the sequence using the embeddings alone. This results in a model, which unlike many competing approaches, requires no explicitly parameterized decoder model. In summary, our primary contributions are as follows: • Temporal Neighbourhood Aggregration -Our proposed model is capable of independently learning the temporal evolutionary patterns within the neighbourhood of a vertex at different depths, resulting in superior performance at predicting future links. Moreover, our approach requires no additional vertex features, labels or random walk procedures as part of its process. • Variational Sampling -More robust temporal representations and consequently accurate prediction of the next graph in the evolving sequence is made possible by our approach by sampling vertex embeddings using the principals of variational inference. • Model Efficacy and Scalability -Our model contains significantly fewer parameters than competing approaches, as it does not require a parameterized decoder portion. This leads to our model being scalable to larger graphs as a result of its memory efficiency. Our work is supported by extensive experimentation on public benchmark datasets. Further, to aid reproducibility, we open-source all of our PyTorch [7] based source-code 1 and experimentation scripts. 1 https://github.com/sbonner0/temporal-neighbourhood-aggregation II. RELATED WORKS We highlight prior work in the areas of graph representation learning and temporal embeddings.

A. Graph Representation Learning
Historically, low dimensional graph representations were created via matrix factorization techniques. Examples of such approaches include Laplican eigenmaps [8] and Graph Factorization [9]. More recent models, originally used for Natural Language Processing (NLP) tasks, have been adapted to learn graph embeddings. These approaches exploit random walks to create 'sentences' which can be used as input to languageinspired models such as DeepWalk [10] and Node2Vec [5].
Graph-specific neural network based models have been created, inspired by Convolutional Neural Networks (CNN). Such approaches attempt to create a differential model for learning directly from graph structures. Many Graph CNN approaches operate in the spectral domain of the graph, using eigenvectors derived from the Laplacian matrix of a graph [1]. The Graph Convolutional Network (GCN) approach has proven to be particularly effective [1]. GCN uses a layerwise propagation rule to aggregate information from the 1-hop neighbourhood of a vertex to create its representation. This layer-wise rule can be stacked k times to aggregate information from k hops away.
The approaches discussed thus far have been supervised, mandating the use of labels. However, graph embedding approaches exist which are based on auto-encoders -a type of neural network trained to reconstruct the input data after initially being projected into a lower dimension [11]. For example, GCNs have been used as the basis of a convolutional auto-encoder model [12], demonstrating state-of-the-art results for static link prediction.

B. Temporal Graph Embeddings
We argue that the existing approaches for temporal graph embeddings can be split in two categories: Temporal Walk and Adjacency Matrix Factorisation.
1) Temporal Walk Approaches: In an approach entitled STWalk [13], the authors aim to learn node trajectories via the use of random walks which learn representations that consider all the previous time-steps of a temporal graph. In the best performing approach presented, the authors learn two representations for a given vertex simultaneously which are concatenated to create the final temporal embedding. However, the approach is not end to end and requires the user to manually chose how many time steps to consider.
Yu et al. [14], propose NetWalk, which enables anomaly detection in streaming graphs via a vertex-level dynamic graph embedding model. In the approach, a collection of short random walks captured from the graph is passed into an autoencoder based model to create the vertex representations.
Nguyen et al. [15], propose a model to incorporate temporal information when creating graph embeddings via random walks by capturing individual temporal changes within a graph. They propose a temporal random walk to create the input data, with the approach producing more complex and rich temporal walks via a biasing process.
2) Adjacency Matrix Factorisation Approaches: Goyal et al. [16], propose a model for creating dynamic graph embeddings, entitled DynGEM. In this approach, they extend the auto-encoder graph embedding model of Structural Deep Network Embedding (SDNE) [17] to consider dynamic graphs, by using a method similar to Net2net [18], which is designed to transfer knowledge from one neural network to a second.
In a family of approaches entitled Dyngraph2vec*, comprised of DynAE, DynRNN and DynAERNN, Goyal et al. [19] further extend an SDNE type approach to incorporate temporal information in a variety of ways. The best performing of approaches, DynAERNN, uses a combination of SDNE-like dense auto-encoders, with stacked recurrent layers to learn temporal information when creating vertex embeddings. However, they do not make use of graph convolutions and require a complex decoder model to predict the next graph.
There have been attempts to incorporate temporal aspects into GCNs. However, some [20], [21] focus upon supervised learning, do not explicitly use the models to predict the future graph state and only have a single layer of recurrent connections. More recent approaches, such as GCN-GAN [22] and GC-LSTM [23] require large and complex decoder models, meaning they cannot scale to graphs of one-thousand vertices or more on current hardware, whilst also lacking the variational sampling of our approach. In comparison, EvolveGCN [24] uses recurrent layers to directly evolve the parameters of standard GCN layers which means it does not track vertex neighbourhood evolution explicitly.
One of the application areas most frequently learning temporal models on graphs is that of traffic modelling. Where approaches like [25] and [26] combine graph learning with temporal models to predict traffic movement. However, unlike these approaches we focus on creating vertex level embeddings directly optimised to predict future edges and learn change at different hops of a vertices neighbourhood.

III. METHODOLOGY
We briefly outline the proposed approach, relevant background, network architecture and the training procedure. Throughout, we make use of the notation in Table I.

A. Motivation
Many of the phenomena that are commonly represented via graph structures are known to evolve over time -Links between entities form and break in a constantly evolving stream of changes. We thus view graphs as a series of snapshots, with each graph snapshot containing the connections present at that particular moment in time. More formally, we can redefine a graph G to be a temporal graph G = {G 1 , G 2 , ..., G T }, where each graph snapshot G t ∀t ∈ [1, T ] contains a corresponding vertex set V t and edge set E t . A common and vital task within the field of graph mining is that of future link prediction, where the goal is to accurately predict which vertices within a graph will form a connection in Symbol Definition G A graph with an associated set of vertices V and corresponding set of edges E. A The adjacency matrix of graph G, a symmetric matrix of size |V | × |V |, where (a i , j ) is 1 if an edge is present and 0 otherwise. A A normalised by its degree matrix D and its identity matrix The number of snapshots in G .

Gt
A graph from G . σs The sigmoid activation function. σr The rectified linear activation function (ReLU). σ lr The leaky ReLU activation function. l A certain layer in the model. W    [16]. Figure 1 highlights this future link prediction task, where the goal is to predict the new edges, coloured in red, formed in G T , given the previous graphs in the temporal history G 1 and G 2 . Any model designed to accomplish this task must learn the evolution patterns present in edge formation, even though the number of edges changing at each time point is often a small fraction of the total number. We propose to tackle this by creating temporally-aware graph embeddings, which are explicitly trained to recreate a future time step of the graph. We entitle our approach Temporal Neighbourhood Aggregation (TNA), since to create a better and more meaningful representation for a certain vertex, the model is able to aggregate information about how its neighbourhood has changed in the past to more accurately predict how it will change into the future. More concretely, a temporal graph G is input to our TNA model Θ(G ) which learns a representation for each vertex in G t ∈ G such that its output can accurately predict the graph G t+1 . Ideally, we want to create a model Θ() which can perform this temporal learning using just the sequence of graphs until G t , such that G t+1 = Θ(G 1 , ..., G t ). TNA is able to accomplish this, requiring no pre-processing steps which could affect the models performance (e.g. random walk procedures), no precomputed vertex features and no additional labels.

B. Background Technologies
We first review the background technologies we are employing to make it possible, namely Graph Convolutions [1] and Recurrent Neural Networks [27], [28].

1) Graph Convolutions:
To perform the graph encoding required to create the initial vertex representations, we utilise the spectral Graph Convolution Networks (GCN) [1]. One can consider a GCN to be a differentiable function for aggregating information from the immediate neighbourhood of vertices [29], [30]. A GCN takes the normalised adjacency matrixÂ representing a graph G, and a matrix of initial vertex level features X, and computes a new matrix of vertex level features H = GCN (Â, X). X can be initialized with pre-computed vertex features, but it is sufficient to initialize with one-hot feature vectors (in which case X is the identity matrix I). A GCN can contain many layers which aggregate the data, where the operation performed at each layer by the GCN [1] is: where l is the number of the current layer, W (l) g denotes the weight matrix of that layer, H (l−1) refers to the features computed at the previous layer or is equal to X at l = 0.
One can consider the GCN function to be aggregating a weighted average of the neighbourhood features for each vertex in the graph. Stacking multiple GCN layers has the effect of increasing the number of hops from which a vertexlevel representation can aggregate information -a three layer GCN will aggregate information from three-hops within the graph to create each representation.
The original method requires GCN based models to be trained in a supervised learning framework, where the final vertex representation is tuned via labels provided for a specific task -classification being common [1], [30]. Extensions to the GCN framework have been made which allow for convolutional auto-encoders for graph datasets [12].
2) Recurrent Neural Networks (RNN): RNN are neural networks with circular dependencies between neurons. Activations of a recurrent layer are dependent on their own previous activations from a previous forward pass, and therefore form a type of internal state that can store information across time steps. They are frequently used in sequence processing tasks where the response at one time step should depend in some way on previous observations. Long Short-Term Memory (LSTM) [27] and Gated Recurrent Units (GRU) [28] are RNNs with learned gating mechanisms, which mitigate the vanishing gradient problem when back-propagating errors over a sequence of inputs, allowing the model to learn longer-term dependencies. For this work, we employ the GRU cell, as it empirically offers similar performance to an LSTM, but with fewer overall parameters. The GRU computes the output h t , for the input vector x t at time t in the following manner [28]: where r and u are the rest and update gates and σ s and tanh are the sigmoid and hyperbolic tangent activation functions.

C. Model Overview
We first detail the Temporal Neighbourhood Aggregation blocks which form the primary learning component, before describing the overall model topology and objective function.
1) TNA Block: One of the primary components of our model is the TNA block for topological and temporal learning from graphs. The overall structure of the block is illustrated in Figure 2. It is important to note that all the parameters in the block are shared through time. This allows complex temporal patterns to be learned, as well as allowing for a large reduction in the total number of parameters required by the model. Assuming that the TNA block is the first layer in the model, the flow for vertex v ∈ V t can be described as follows: • The input is passed through the GCN layer, as detailed in Equation 1, which will learn to aggregate information for v from its one-hop neighbourhood to create its representation at this point in the block -h GCN t . This is then normalised using Layer Norm [31], which will ensure that the representation for each vertex is of a similar scale, this has been shown to improve the training stability and convergence rate of deep models [31]. . Inspired by residual connections often used in computer vision networks [32], this enables the model to learn the optimum mix of topological and temporal information.
The layer-wise propagation rule of the TNA block at depth l can thus be summarised as follows for the entire graph G t ∈ G with normalised adjacency matrixÂ t : where W (l) s represents the weight matrix used to mix the topological and temporal representations, and σ lr is the leaky ReLU activation function with a negative slope of 0.01.
2) Overall Model Architecture: As with normal GCN layers, TNA blocks can be stacked to aggregate information from greater depth within a graph, with each additional block adding one extra hop from which information can be aggregated for a certain vertex. However, as our TNA blocks are recurrent, information can also be aggregated from how connectivity within these hops has evolved over time, instead of just their present state. After extensive ablation studies (detailed in Section V-A), we use the final configuration of the model detailed in Figure 3. Our model contains two stacked TNA blocks, to learn information from two hops within the temporal neighbourhood. This is then passed to two independent GCN layers which perform a final aggregation of this temporal representation. From these two layers, the final representation matrix Z t is sampled using techniques from variational inference, specifically the reparametrisation trick [33].
Variational Sampling -To create the final representation matrix Z t ∈ R |Vt|×d , the output from the two GCN layers GCN µ and GCN σ are used to parametrise a unit Gaussian distribution N , from which Z t is then sampled, rather than being explicitly drawn. This is the same concept used in Variational Auto-Encoders [33], and has previously been demonstrated to work well for creating more robust and meaningful vertex level representations [12], [34]. Our inference model used to create the vertex representations of graph G t , with adjacency matrix A t and identity matrix of A t , X t , can thus be described as : where q is our approximation of the true and intractable distribution we are interested in capturing -p(A t+1 |Z t ). Here, both GCN µ and GCN σ take input from two stacked TNA layers as detailed in Figure 3. Generative Model -To decode the information contained within Z t , a generative model is created to explicitly predict the new edges appearing in the next graph in the sequence. Here, the inner-product between the latent representation is used to directly predict A t+1 : where A t+1i,j represents elements from A t+1 and z refers to the rows of each vertex taken from Z t . This generative model is one of the key advantages of our approach, as it means that we have zero learnable parameters in the decoder portion of the model. This is in contrast to many competing approaches, which often require as many parameters as in the encoder to create a decoder with the desired functionality [19]. This results in our approach being able to scale to significantly larger graphs, with longer histories than some of the competing approaches, whilst also being less prone to over-fitting to none-changing edges.

D. Objective Function
To train the TNA model, and as is common for variational methods [12], [33], we directly optimise the lower bound L with regards to the model parameters: where KL() is the Kullback-Leibler distance between p and q. We use a Gaussian prior as the distribution for p(Z t ).
In addition, we apply L 2 regularization to our model parameters to help with over-fitting, which is defined as: where λ is a scaling factor, set to 10 −5 . Consequently, the final objective function for our model is:

E. Model Parameters and Training Procedure
After initial grid-searches, we empirically found two layers of Temporal Neighbourhood Aggregation, followed by variational sampling, to yield the optimal performance, with the first layer comprising 32 filters, whilst the second having 16 filters. For training the model, we empirically found using full-batch gradient descent with the RMSProp algorithm, a learning rate of 0.001 and 200 epochs to give the best results. Our model has been implemented in PyTorch [7].

IV. EXPERIMENTAL SETUP
We detail the setup of our experimental evaluation, as well as the baseline approaches and the datasets we use.

A. Evaluation Overview and Methodology
As the primary goal is to create vertex representations which are better at encoding temporal change, we will be using the task of future link prediction as our primary objective. More formally, we are trying to maximise the probability of P(G t |G 1 ...G t−1 ). In the context of machine learning, this can be defined as training a model from a temporal G using G 1 ...G t−1 such that it can predict the new edges in G t , E t \ E t−1 . The full training and evaluation process is detailed in Algorithm 1. Many recent methods attempt to solve this problem via vertex embedding similarity -i.e. vertices with more similar embeddings, according to some metric, are more likely to be connected via an edge [5], [10], [12].
Graph edges are predicted as follows: given the learned vertex embeddings, the future adjacency matrix is reconstructed via the dot product of the embedding matrix A t+1 = σ(Z t Z T t ). This reconstructed adjacency matrix is compared with the true graph to assess how well the embedding is able to reconstruct the future graph. Load and pre-process the graphs G 1 , G 2 , ..., G T 3 Create new model Θ i (as shown in Figure 3) 4 Train Θ i on sequence G 1 , G 2 , ..., G t−1 , where each graph is the input and used to predict the following one 5 Predict new edges in G t using Θ i (G t−1 ): Store AUC and AP values 7 end 8 return Mean AUC and AP values over G

B. Performance Metrics
As one can consider the task of link prediction to be a binary classification problem (an edge can only be present or not), we make use of two standard binary classification metrics:

C. Datasets
When performing our experimental evaluation, we employ the empirical datasets presented in Table II. The graphs represent a range of domains, sizes and temporal complexities.
Bitcoin-Alpha (Bitcoina) -Representing a trust network within a platform entitled Bitcoin Alpha, where edges are formed as users interact and rate each others reputation. The graph covers a range of edges formed between 8th October 2010 and 22nd January 2016, which we partition into 62 monthly snapshots. The task of new edge prediction is thus analogous to predicting if two users are going to interact within the next month.
Wiki-Vote (Wiki) -Representing a vote of escalating user privileges between users and administrators on the Wikipedia website. The graph covers a range of edges formed between 28th March 2004 and 6th January 2008, which we partition into 34 monthly snapshots. The task of new edge prediction within this data is analogous to predicting if two users are going to vote for each other within the next week.
UCI-Messages (UCI) -Representing private messages sent between users on the University of California Irvine social network platform. The graph covers a range of edges formed between 15th April 2004 and 25th October 2004, which we  partition into 27 weekly snapshots. The task of new edge prediction would represent the likelihood that two users will exchange messages with each other over the next week. 1) Synthetic Datasets: In addition, we use two synthetic datasets: a Stochastic Block Model (SBM) graph and a randomly perturbed version of the Cora dataset (R-Cora).
SBM -A random graph of 3,000 vertices, which evolves over 30 time points using the SBM algorithm [37]. The graph contains 3 communities and at each time point, 20 vertices will evolve by switching from one community to another.
R-Cora -To create this synthetic dataset, we take the original Cora dataset representing a citation network, and perturb the graph using the random rewire method [38], [39]. The rewiring process alters a given source graph's degree distribution by randomly altering the source and target of a set number of edges. During this rewiring process, it is not guaranteed that the source or target of the edge will be altered, which indeed is not always possible due to the topology of the graph. Also, the rewiring process does not change the total number of edges or vertices within the graph. We employ Erdős rewiring, i.e. the resulting topology of the graph begins to resemble a Erdős-Rényi graph, where the edges are uniformly distributed between vertices.

D. Baseline Approaches
We compare our approach against a variety of state-ofthe-art graph representation learning techniques, both static and dynamic. We choose the baselines which compare most directly with our proposed approach, meaning we opt for comparators which take advantage of deep neural networks to create vertex embeddings.
• GAE [12]: A non-probabilistic Graph Convolutional Auto-encoder (GAE), where the model is trained on G t−1 and then directly predicts new edges in G t . • GVAE [12]: A Graph Variational Convolutional Autoencoder (GVAE), trained in the same manner as the GAE. • TO-GAE [34]: A GAE model training procedure which enables temporal offset reconstruction, where the model is trained on G t−2 to predict G t−1 . G t−1 is subsequently used as input and the ability to predict G t is measured. • TO-GVAE [34]: A GVAE model trained using the temporal offset reconstruction method. • DynAE [19]: A non-convolutional graph embedding model, similar to SDNE [17], extended to temporal graphs by concatenating the rows of the past graphs together before being passed into the model. • DynRNN [19]: A non-convolutional graph embedding model, where stacked LSTM units are used to encode the temporal graph directly. The approach also requires a decoder model, also comprised of stacked LSTM units, to reconstruct the next graph from the embedding. • DynAERNN [19] 2 : A combination of the previous two models, where a dense auto-encoder is used to learn a compressed representation which is passed to stacked LSTM units for temporal learning. It requires a large decoder, with both dense and LSTM layers, to predict the next graph. The E-LSTM-D approach [41] is also extremely similar to this model. • D-GCN: [20], [21]: A dynamic GCN, similar to approaches proposed in [20] and [21]. Here, three stacked GCN layers are used to capture structural information with an LSTM unit used to learn temporal information and produce the final embeddings. To directly predict the next graph, we use an inner-product decoder on the embedding matrix. We attempted to compare with GCN-GAN [22] and GC-LSTM [23], but we were unable to get them to scale to the size of graphs we are using for our experimentation.

V. RESULTS
We evaluate our TNA approach using comparisons against state-of-the-art approaches and ablation studies using wellestablished datasets (Section IV-C).

A. Ablation Study
One of the major contributions of our work is highlighting how each component of our TNA model is crucial in producing good temporal embeddings. To highlight this, Table IV shows how adding components of the model sequentially affects the performance of predicting new edges in the final graph of the Bitcoina dataset. It is important to note that adding temporal information from both the first and second hop neighbourhood (Model TTV) lifts both AUC and AP scores by approximately 10% versus just first hop temporal information (Model TGV). This supports our hypothesis that a vertex requires temporal information from more than just its first-order neighbourhood in order to predict future edges. The ablation study also demonstrates that, with a modest increase in the number of parameters, the temporal models are able to exploit the rich information available in the graph's past evolution to much more accurately predict future edges.     |Θ| is the total number of learnable parameters in the model.

B. Next Graph Link Prediction
As the main focus of our model, we present results for predicting new edges in the next temporal graph, using the procedure detailed in Algorithm 1, in Table III 3 . The table shows that TNA significantly outperforms the baseline approaches when predicting new edges in the next graph at all points along the time series. Compared with the Dyn* family of approaches, it is striking to note the significant number of parameters required by the models (often well over an order of magnitude more) and their poor performance in predicting new edges. We believe it is highly likely that this family of models is using the extra parameters to over-fit to the edges that do not change over time, resulting in bad predictive capability for the ones that do. It is also interesting to note that, compared with the D-GCN approach, TNA is better able to capture the dependences needed for good long-term prediction. For two  datasets our model improves the past graph evolution data it has to learn from. This is demonstrated by the increasing AUC and AP scores for the Bitcoina and UCI datasets. However, all approaches struggle on the synthetic datasets due to the inherent random nature, as seen in Table V.

C. Full Graph Reconstruction
To measure the ability of the representations learned by the TNA model to be used as general purpose embeddings, we look at the problem of future graph reconstruction. Here, the performance of the model at predicting the presence of edges in the full graph G t (given G 1 ..G t−1 ) is measuredhighlighting how we do not sacrifice performance at predicting existing edges. This will allow us to investigate the ability of the model to predict not only new edges, but that existing edges have not been removed. As before, a new model is trained to predict the final graph in the sequence given all previous time points, with the final results presented as the mean over all graphs in the sequence. However, instead of predicting edges which have appeared since the last time point, here the results are for a balanced set of random sampled positive and negative edges in E t which may or may not include ones formed since the previous time point.
The results for this experiment are presented in Table VI where for the sake of brevity, we compare with only the temporal baselines. It is obvious that many of the baselines, especially the Dyn* family of approaches perform much better at predicting existing edges than new ones. This further suggests that they are utilising their larger set of parameters to, in some way, over-fit to edges which have been in the graph for a longer length of time, which form the vast majority. However despite this, our TNA approach still performs well at this task, displaying comparable performance with the baseline approaches and even outperforming them on the Wiki dataset. This further strengthens the argument that having recurrence at  each hop in the neighbourhood aggregation produces a better representation, whilst requiring fewer parameters.

D. Future Graph Evolution
For our final experiment, we investigate how TNA performs when predicting new edges further into the future than the next graph. We train the models on 70% of the available temporal history, then predict new edges and compare with the remaining ground truth data. To achieve this, we feed the graph predicted by the models as the next graph in the sequence back into the model, which is subsequently used to predict the next graph. This is similar to using RNNs as generative models to produce text data [42] and can be seen as a combination of both the previous tasks. Figure 4 displays the results for this task, where we compare with the closet baseline from Section V-B. The results show how TNA is better able to predict new edges into the future, emphasising its capability to learn a good temporal representation for the vertices.

VI. CONCLUSION
Many real-world graph datasets have rich and complex temporal information available which is disregard by the majority of the current approaches for creating vertex representations. In this paper, we have introduced the Temporal Neighbourhood Aggregation model for representation learning on large, complex temporal graphs. Our approach demonstrates excellent performance through extensive experimental evaluation, beating several competing temporal and static models, when predicting future edges not seen in the training data. The TNA model can learn complex temporal patterns present at multiple depths within a vertices neighbourhood, creating the final vertex representation via the use of variational sampling.
For future work, we will investigate replacing the GCN in our model with an approach designed for inductive learning [30] to allow for training on even larger graph datasets, as well as enabling vertex arrival to be modelled. We also plan to experiment using the learned representations for additional tasks, such as temporal classification.