# lanczosnet_multiscale_deep_graph_convolutional_networks__6787d6c0.pdf Published as a conference paper at ICLR 2019 LANCZOSNET: MULTI-SCALE DEEP GRAPH CONVOLUTIONAL NETWORKS Renjie Liao1,2,3, Zhizhen Zhao4, Raquel Urtasun1,2,3, Richard S. Zemel1,3,5 University of Toronto1, Uber ATG Toronto2, Vector Institute3, University of Illinois at Urbana-Champaign4, Canadian Institute for Advanced Research5 {rjliao, urtasun, zemel}@cs.toronto.edu, zhizhenz@illinois.edu We propose the Lanczos network (Lanczos Net), which uses the Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution. Relying on the tridiagonal decomposition of the Lanczos algorithm, we not only efficiently exploit multi-scale information via fast approximated computation of matrix power but also design learnable spectral filters. Being fully differentiable, Lanczos Net facilitates both graph kernel learning as well as learning node embeddings. We show the connection between our Lanczos Net and graph based manifold learning methods, especially the diffusion maps. We benchmark our model against several recent deep graph networks on citation networks and QM8 quantum chemistry dataset. Experimental results show that our model achieves the state-of-the-art performance in most tasks. 1 INTRODUCTION Graph-structured data is ubiquitous in real world applications, social networks, gene expression regulatory networks, protein-protein interactions, and many other physical systems. How to model such data using machine learning, especially deep learning, has become a central research question [1]. For supervised and semi-supervised tasks such as graph or node classification and regression, learning based models can be roughly categorized into two classes, formulated either in terms of graph convolutions [2] or recurrent neural networks [3]. Methods based on recurrent neural networks (RNN), especially graph neural networks (GNN) [3], repeatedly unroll a message passing process over the graph by exchanging information between the nodes. In theory, a GNN can have as large a model capacity as its convolutional counterpart. However, due to the instability of RNN dynamics and difficulty of optimization, GNN and its variants are generally slower and harder to train. In this paper we focus on graph convolution based methods. Built on top of the graph signal processing (GSP) approaches [4], these methods extend convolution operators to graphs by leveraging spectral graph theory, graph wavelet theory, etc. Graph convolutions can be stacked and combined with nonlinear activation functions to build deep models, just as in regular convolutional neural networks (CNN). They often have large model capacity and achieve promising results. Also, graph convolution can be easily implemented with modern scientific computing libraries. There are two main issues with current graph convolution approaches. First, it is not clear how to efficiently leverage multi-scale information except by directly stacking multiple layers. Having an effective multi-scale scheme is key for enabling the model to be invariant to scale changes, and to capture many intrinsic regularities [5, 6]. Graph coarsening methods have been proposed to form a hierarchy of multi-scale graphs [7], but this coarsening process is fixed during both inference and learning which may cause some bias. Alternatively, the graph signal can be multiplied by the exponentiated graph Laplacian, where the exponent indicates the scale of the diffusion process on the graph [8]. Unfortunately, the computation and memory cost increases linearly with the exponent, which prohibits the exploitation of long scale diffusion in practice. Other fast methods for computing matrix power such as exponentiating by squaring are very memory intensive, even for moderately large graphs. Second, spectral filters within current graph convolution based models Published as a conference paper at ICLR 2019 are mostly fixed. In the context of image processing, using a Gaussian kernel along with a spectral filter f(λ) = 2λ λ2 corresponds to running forward the heat equation (blurring) followed by running it backwards (sharpening) [9]. Multi-scale kernels introduced in [10] extends the idea of forward-backward diffusion process and can be represented as polynomials of matrices related to a Gaussian kernel. Learning the spectral filters is thus beneficial since it learns the stochastic processes on the graph which produce useful representations for particular tasks. However, how to learn spectral filters which have large model capacity is largely underexplored. In this paper, we propose the Lanczos network (Lanczos Net) to overcome the aforementioned issues. First, based on the tridiagonal decomposition implied by the Lanczos algorithm, our model exploits the low rank approximation of the graph Laplacian. This approximation facilitates efficient computation of matrix powers thus gathering multi-scale information easily. Second, we design learnable spectral filters based on the approximation which effectively increase model capacity. In scenarios where one wants to learn the graph kernel and/or node embeddings, we propose another variant, i.e., adaptive Lanczos network (Ada Lanczos Net), which back-propagates through the Lanczos algorithm. We show that our proposed model is closely related to graph based manifold learning approaches such as diffusion maps which could potentially inspire more work from the intersection between deep graph networks and manifold learning. We benchmark against 9 recent deep graph networks, including both convolutional and RNN based methods, on citation networks and a quantum chemistry graph regression problem, and achieve state-of-the-art results in most tasks. 2 BACKGROUND In this section, we introduce some background material. A graph G with N nodes is denoted as G = (V, E, A), where A RN N is an adjacency matrix which could either be binary or real valued. X RN F is the compact representation of node features (or graph signal in the GSP literature). For any node v V, we denote its feature as a row vector Xv,: R1 F . We use X:,i to denote the i-th column of X. Graph Fourier Transform Given input node features X, we now discuss how to perform a graph convolution. Based on the adjacency matrix A, we compute the graph Laplacian L which can be defined in different ways: (1) L = D A; (2) L = I D 1A; (3) L = I D 1 2 , where D is a diagonal degree matrix and Di,i = PN j=1 Ai,j. The definition (3) is often used in the GSP literature due to the fact that it is real symmetric, positive semi-definite (PSD) and has eigenvalues lying in [0, 2]. In certain applications [11], it was found that adding self-loops, i.e., changing A to A + I, and using the affinity matrix S = D 1 2 instead of L gives better results. Since S is real symmetric, based on spectral decomposition, we have S = UΛU where U is an orthogonal matrix and its column vectors are the eigenvectors of S. The diagonal matrix Λ contains the sorted eigenvalues where Λi,i = λi and 1 λ1 λN 1. Based on the eigenbasis, we can define the graph Fourier transform Y = U X and its inverse transform ˆX = UY following [12]. Note that L = I D 1 2 shares the same eigenvectors with S = D 1 2 and the eigenvalues of L are µi = 1 λi. Therefore, L and S share the same graph Fourier transform which justifies the usages of S. Different forms of filters can be further constructed in the spectral domain. Localized Polynomial Filter A τ-localized polynomial filter is typically adopted in GSP literature [12], gw(Λ) = Pτ 1 t=0 wtΛt, where w = [w0, w1, . . . , wτ 1] Rτ 1 is the filter coefficient, i.e., learnable parameter. The filter is τ-localized in the sense that the filtering leverages information from nodes which are at most τ-hops away. One prominent example of this class is the Chebyshev polynomial filter [7]. Here the graph Laplacian is modified to L = 2L/λmax I such that its eigenvalues fall into [ 1, 1]. Then the Chebyshev polynomial recursion is applied: X(t) = 2 L X(t 1) X(t 2) where X(0) = X and X(1) = LX. For a pair of input and output channels (i, j), the final filtering becomes, yi,j = [ X(0):,i, . . . , X(τ 1):,i]wi,j, where [ ] means concatenation along columns and wi,j Rτ 1. Chebyshev polynomials provide two benefits: they form an orthogonal basis of L2([ 1, 1], dy/ p 1 y2) and one avoids the spectral decomposition of L in the filtering. However, the functional form of the spectral filter is not learnable, and cannot adapt to the data. Published as a conference paper at ICLR 2019 In this paper, instead of using the modified graph Laplacian L, we use the aforementioned S. Therefore, we can write the localized polynomial filtering in a more general form as, t=0 gt(S, . . . , St, X)Wt, (1) where gt is a function that takes node features X and powers of the affinity matrices up to the t-th order as input and outputs a N F matrix. Wt RF O is the corresponding filter coefficient and Y RN O is the output. One can easily verify that in the Chebyshev polynomial filter, any i-th column of the corresponding gt(X, S, . . . , St) lies in the Krylov subspace Kt+1(S, X:,i) span{X:,i, SX:,i, . . . , St X:,i}. This naturally motivates the usage of Krylov subspace methods, like the Lanczos algorithm [13], since it provides an orthonormal basis for the above Krylov subspace, thus making the filter coefficients compact. 3 LANCZOS NETWORKS In this section, we first introduce the Lanczos algorithm which approximates the affinity matrix S. We present our first model, called Lanczos network (Lanczos Net), in which we execute the Lanczos algorithm once per graph and fix the basis throughout inference and learning. Then we introduce the adaptive Lanczos network (Ada Lanczos Net) in which we learn the graph kernel and/or node embedding by back-propagating through the Lanczos algorithm. Algorithm 1 : Lanczos Algorithm 1: Input: S, x, K, ϵ 2: Initialization: β0 = 0, q0 = 0, and q1 = x/ x 3: For j = 1, 2, . . . , K: 4: z = Sqj 5: γj = q j z 6: z = z γjqj βj 1qj 1 7: βj = z 2 8: If βj < ϵ, quit 9: qj+1 = z/βj 10: 11: Q = [q1, q2, , q K] 12: Construct T following Eq. (2) 13: Eigen decomposition T = BRB 14: Return V = QB and R. Algorithm 2 : Lanczos Net 1: Input: Signal X, Lanczos output V and R, scale index sets S and I, 2: Initialization: Y0 = X 3: For ℓ= 1, 2, . . . , ℓc: 4: Z = Yℓ 1, Z = { } 5: For j = 1, 2, . . . , max(S): 6: Z = SZ 7: If j S: 8: Z = Z Z 9: For i I: 10: Z = Z V ˆR(Ii)V Yℓ 1 11: Yℓ= concat(Z)Wℓ 12: If ℓ< L 13: Yℓ= Dropout(σ(Yℓ)) 14: Return Yℓc. 3.1 LANCZOS ALGORITHM Given the aforementioned affinity matrix S1 and node features x RN 1, the N-step Lanczos algorithm computes an orthogonal matrix Q and a symmetric tridiagonal matrix T, such that Q SQ = T. We denote Q = [q1, , q N] where column vector qi is the i-th Lanczos vector. T is illustrated as below, β1 ... ... ... ... βN 1 βN 1 γN One can verify that Q forms an orthonormal basis of the Krylov subspace KN(S, x) and the first K columns of Q forms the orthonormal basis of KK(S, x). The Lanczos algorithm is shown in detail in Alg. 1. Intuitively, if we investigate the j-th column of the system SQ = QT and rearrange terms, 1When faced with a non-symmetric matrix, one can resort to the Arnoldi algorithm. Published as a conference paper at ICLR 2019 we obtain βjqj+1 = Sqj βj 1qj 1 γjqj, which clearly explains lines 4 to 6 of the pseudocode, i.e., it tries to solve the system in an iterative manner. Note that the most expensive operation in the algorithm is the matrix-vector multiplication in line 4. After obtaining the tridiagonal matrix T, we can compute the Ritz values and Ritz vectors which approximate the eigenvalues and eigenvectors of S by diagonalizing the matrix T. We only add this step in Lanczos Net as we found back-propagating through the eigendecomposition in Ada Lanczos Net is not numerically stable. 3.2 LANCZOSNET In this section, we first show the construction of the localized polynomial filter based on the Lanczos algorithm s output and discuss its limitations. Then we explain how to construct the spectral filter using a particular low rank approximation and how to further make the filter learnable. At last, we elaborate how to construct multi-scale graph convolution and build a deep network. Localized Polynomial Filter For the ease of demonstrating the concept of Krylov subspace, we consider a pair of input and output channels (i, j). We denote the input as X:,i RN 1 and the output as Y:,j RN 1. Executing the Lanczos algorithm for K steps with the normalized X:,i as the starting vector, one can obtain the orthonormal basis Q of KK(S, X:,i) and the corresponding tridiagonal matrix T. Recall that in the localized polynomial filtering, given the orthonormal basis of KK(S, X:,i), one can write the graph convolution as Yj = Qwi,j, (3) where Q RN K depends on the X:,i and wi,j RK 1 is the learnable parameter. This filter has the benefit that the corresponding learnable coefficients are compact due to the orthonormal basis. However, if one wants to stack multiple graph convolution layers, the dependency of Q on X:,i implies that a separate run of Lanczos algorithm is necessary for each graph convolution layer which is computationally demanding. Spectral Filter Ideally, we would like to compute Lanczos vectors only once during the inference of a deep graph convolutional network. Luckily, this can be achieved if we take an alternative view of Lanczos algorithm. In particular, we can choose a random starting vector with unit norm and treat the K step Lanczos layer s output as the low rank approximation S QTQ . Note that here Q RN K has orthonormal columns and does not depend on the node features Xi and T is a K K tridiagonal matrix. Following [14], we prove the theorem below to bound the approximation error. Theorem 1. Let UΛU be the eigendecomposition of an N N symmetric matrix S with Λi,i = λi, λ1 λN and U = [u1, . . . , u N]. Let Uj span{u1, . . . , uj}. Assume K-step Lanczos algorithm starts with vector v and outputs the orthogonal Q RN K and tridiagonal T RK K. For any j with 1 < j < N and K > j, we have, sin (v, Ui) Qj 1 k=1(λk λN)/(λk λj) cos (v, ui)TK i(1 + 2γi) i=j+1 λ2 i , where TK i(x) is the Chebyshev Polynomial of degree K i and γi = (λi λi+1)/(λi+1 λN). We leave the proof to the appendix. Note that the term (PN i=j+1 λ2 i )1/2 is the Frobenius norm of the error between S and the best rank-j approximation Sj. We decompose the tridiagonal matrix T = BRB , where the K K diagonal matrix R contains the Ritz values and B RK K is an orthogonal matrix. We have a low rank approximation of the affinity matrix S V RV , where V = QB. Therefore, we can rewrite the graph convolution as, Yj = [Xi, SXi, . . . , SK 1Xi]wi,j [Xi, V RV Xi, . . . , V RK 1V Xi]wi,j, (4) The difference between Eq. (3) and Eq. (4) is that the former uses the orthonormal basis while the latter uses the approximation of the direct basis of KK(S, X:,i). Since we explicitly operate on the approximation of spectrum, i.e., Ritz value, it is a spectral filter. Such a filtering form will have significant computational benefits while considering the long range/scale dependency due to the fact that the t-th power of S can be approximated as St V Rt V , where we only need to raise the diagonal entries of R to the power t. Published as a conference paper at ICLR 2019 Learning the Spectral Filter Following the previous filter, one can naturally design learnable spectral filters. Specifically, we use K different spectral filters of which the k-th output ˆR(k) = fk([R, R1, . . . , RK 1]), where fk is a multi-layer perceptron (MLP) and R is the diagonal vector of the corresponding diagonal matrix. We then construct a diagonal matrix ˆR(k) based on the vector output of fk. Therefore, we have the following filtering, Yj = [Xi, V ˆR(1)V Xi, . . . , V ˆR(K 1)V Xi]wi,j. (5) Note that it includes the polynomial filter as a special case. When positive semi-definiteness is a concern, one can apply an activation function like Re LU to the output of the MLPs. Multi-scale Graph Convolution Using any above filter, one can construct a deep graph convolutional network which leverages multi-scale information. Taking the learnable spectral filter as an example, we can write one graph convolution layer in a compact way as below, Y = h LS1X, . . . , LSM X, V ˆR(I1)V X, . . . , V ˆR(IN)V X i W, (6) where weight W R(M+E)D O, S is a set of M short scale parameters and I is a set of E long scale parameters. We consider a non-negative integer as scale parameter, e.g., S = {0, 1, . . . , 5}, I = {10, 20, . . . , 50}. Note that the convolution corresponding to short scales is similar to [8] where the number of matrix-vector multiplications is tied to the maximum scale of S. In contrast, the convolution of long scales decouples the Lanczos step K and scale parameters I, thus permitting great freedom in tuning scales as hyperparameters. One can choose K properly to balance the computation cost and the accuracy of the low rank approximation. In our experiments, short scales are typically less than 10 which have reasonable computation cost. Moreover, the short scale part could sometimes remedy cases where the low rank approximation is crude. We set the long scale no larger than 100 in our experiments. If the maximum eigenvalue of S is 1, we can even raise the power to infinity, which corresponds to the equilibrium state of diffusion process on the graph. To build a deep network, we can stack multiple such graph convolution layers where each layer has its own spectral filter weights. Nonlinear activation functions, e.g., Re LU, and/or Dropout can be added between layers. The inference algorithm of such a deep network is shown in Alg. 2. With the top layer representation, one can use softmax to perform classification or a fully connected layer to perform regression. The Lanczos algorithm is run beforehand once per graph to construct the network and will not be invoked during inference and learning. 3.3 ADALANCZOSNET In this section, we explain another variant which back-propagates through the Lanczos algorithm. This facilitates learning the graph kernel and/or node embeddings. Graph Kernel Assume we are given node features X and a graph G. We are interested in learning a graph kernel function with the hope that it can capture the intrinsic geometry of node representations. Given data points xi, xj X, we define the anisotropic graph kernel, k : X X 7 R as, k(xi, xj) = exp (fθ(xi) fθ(xj)) 2 where fθ is a MLP. This class of anisotropic kernels is very expressive and includes self-tuning kernel [15] and the Gaussian kernel with Mahalanobis distances [16]. Moreover, for different kernel functions, the resulted graph Laplacians will converge to different limiting operators asymptotically. For example, even for isotropic Gaussian kernels, the graph Laplacian can converge pointwise to the Laplace-Beltrami, Fokker-Planck operator and heat kernel under different normalizations [17, 18]. In practice, we notice that choosing ϵ = P (p,q) E (fθ(xp) fθ(xq)) 2/|E| helps normalizing the pairwise distances, thus avoiding the gradient vanishing issue due to the exponential function. This type of learnable anisotropic diffusion is useful in two ways. First, it increases model capacity, thus potentially gaining better performance. Second, it can better adapt to the non-uniform density of the data points on the manifold or nonlinear measurements of the underlying data points on a maninfold. We can construct an adjacency matrix A such that Ai,j = k(xi, xj) if (i, j) E and Ai,j = 0 otherwise. Then we can obtain the affinity matrix S = D 1 Published as a conference paper at ICLR 2019 Node Embedding In some applications, we do not observe the node features X but only the graph itself G, so we may need to learn an embedding vector per node. For example, this scenario applies in the quantum chemistry tasks where a node, i.e., an atom within a molecule, has rarely observed features. We can still use the above graph kernel to construct the affinity matrix which results in the same form except f is discarded. Learning embedding X naturally amounts to learning the similarities between nodes. Tridiagonal Decomposition Although all operations in Lanczos Net are differentiable, we empirically observe that backpropagation through the eigendecomposition of the tridiagonal matrix is numerically instable. The situation would be even worse if multiple eigenvalues are numerically close or one takes a large power in Eq. (6). Therefore, we instead directly leverage the approximated tridiagonal decomposition S QTQ which is obtained by running the Lanczos algorithm K steps. Then we can rewrite the graph convolution layer with learnable spectral filter as following, Y = SS1X, . . . , SSM X, Qf1 T I1 Q X, . . . , Qf N T IN Q X W, (8) where fi is a learnable spectral filter. Each f is constructed from a separate MLP denoting as g which takes T RK K as input and outputs a same sized matrix. To ensure f outputs a symmetric matrix, we define f(T) = g(T) + g(T) . With the above parameterization of the graph Laplacian and tridiagonal decomposition, we can back-propagate the loss through the Lanczos algorithm to either the graph kernel parameters θ or the node embedding X. The overall model is similar to the Lanczos Net except that the Lanczos algorithm needs to be invoked for each inference pass. 4 LANCZOS NETWORK AND DIFFUSION MAPS In this section, we highlight the relationship between Lanczos Net and an important example of graph based manifold learning algorithms, diffusion maps [17]. Diffusion Maps In diffusion maps, the weights in the adjacency matrix define a discrete random walk over the graph, where the Markov transition matrix P = D 1A shows the transition probability in a single time step. Therefore, P t i,j sums the probability of all paths of length t that start at node i and end at node j. It is shown in [17] that P can be used to define an inner product in a Hilbert space. Specifically, we use the eigenvalues and right eigenvectors {λl, ψl}N l=1 of P to define a diffusion mapping Φt as, Φt(i) = λt 1ψ1(i), λt 2ψ2(i), . . . , λt NψN(i) , (9) where ψl(i) is the i-th entry of the eigenvector ψl. Since the row stochastic matrix P is similar to S, i.e., P = D 1/2SD1/2, we have ψl = D 1/2ul. The mapping Φt satisfies PN k=1 P t i,k P t j,k/Dk,k = Φt(i), Φt(j) , where , is the inner product over Euclidean space. The diffusion distance between i and j, d2 DM,t(i, j) = Φt(i) Φt(j) 2 = PN k=1 (P t i,k P t j,k)2/Dk,k, is the weighted-l2 proximity between the probability clouds of random walkers starting at i and ending at j after t steps. Since all eigenvalues of S reside in the interval [ 1, 1], for some large t, λt l in Eq. (9) is close to zero, and d DM,t can be well approximated by using only a few largest eigenvalues and their eigenvectors. Connection to Graph Convolution Apart from using diffusion maps to embed node features X at different time scales, one can use it to compute the frequency representations of X as below, ˆX = Λt U X, (10) where U are the eigenvectors of S and define the graph Fourier transform. The frequency representation ˆX is weighted by the powers of the eigenvalues λt l, suppressing entries with small magnitude of eigenvalues. Recall that in the convolution layer Eq. (5) of Lanczos Net, we use multiple such frequency representations with different scales t and replace the eigenvalues Λ in Eq. (10) with their approximation, i.e., Ritz values. Therefore, in Lanczos Net, spectral filters are actually applied to the frequency representations which are obtained by projecting the node features X onto multiple diffusion maps with different scales. Published as a conference paper at ICLR 2019 5 RELATED WORK We can roughly categorize the application of machine learning, especially deep learning, to graph structured data into supervised/semi-supervised and unsupervised scenarios. For the former, a majority of work focuses on node/graph classification and regression [19, 20, 21, 1]. For the latter, unsupervised node/graph embedding learning [22, 23] is common. Recently, generative models for graphs, such as molecule generation, has drawn some attention [24, 25]. Graph Convolution Based Models The first class of learning models on graphs stems from graph signal processing (GSP) [12, 4] which tries to generalize convolution operators from traditional signal processing to graphs. Relying on spectral graph theory [26] and graph wavelet theory [27], several definitions of frequency representations of graph signals have been proposed [4]. Among these, spectral graph theory based one is popular, where graph Fourier transform and its inverse are defined based on the eigenbasis of the graph Laplacian. Following this line, many graph convolution based deep network models emerge. [2, 28] are among the first to explore Laplacian based graph convolution within the context of deep networks. Meanwhile, [29] performs graph convolution directly based on the adjacency matrix to predict molecule fingerprints. [30] proposes a strategy to form same sized local neighborhoods and then apply convolution like regular CNNs. Chebyshev polynomials are exploited by [7] to construct localized polynomial filters for graph convolution and are later simplified in graph convolutional networks (GCN) [11]. Further accelerations for GCN based on importance sampling and control variate techniques have been proposed by [31, 32]. Several attention mechanisms have been introduced in [33, 34] to learn the weights over edges for GCNs. Notably, [8] proposes diffusion convolutional neural networks (DCNN) which uses diffusion operator for graph convolution. Lanczos method has been explored for graph convolution in [35] for the purpose of acceleration. Specifically, they only consider the localized polynomial filter case in our Lanczos Net variant and do not explore the low rank decomposition, learnable spectral filter and graph kernel/node embedding learning as we do. Recurrent Neural Networks based Models The second class of models dates back to recursive neural networks [36] which recurrently apply neural networks to trees following a particular order. Graph neural networks (GNN) [3] generalize recursive neural networks to arbitrary graphs and exploit the synchronous schedule to propagate information on graphs. [37] later proposes the gated graph neural networks (GGNN) which improves GNN by adding gated recurrent unit and training the network with back-propagation through time. [38] learns graph embeddings via unrolling variational inference algorithms over a graph as a RNN. [39] introduces random subgraph sampling and explores different aggregation functions to scale GNN to large graphs. [40] proposes asynchronous propagation schedules based on graph partitions to improve GNN. Moreover, many applications have recently emerged for GNNs, including community detection [41], situation recognition [42], RGBD semantic segmentation [43], few-shot learning [21], probabilistic inference [44], continuous control of reinforcement learning [45, 46] and so on. Graph based Manifold Learning The non-linear dimensionality reduction methods, such as locally linear embedding (LLE) [47], ISOMAP [48], Hessian LLE [49], Laplacian eigenmaps [50], and diffusion maps [17], assume that the high-dimensional data lie on or close to a low dimensional manifold and use the local affinities in the weighted graph to learn the global features of the data. They are invaluable tools for embedding complex data in a low dimensional space and regressing functions over graphs. Spectral clustering [51, 52], semi-supervised learning [53], and out-of-sample extension [54] share the similar geometrical consideration of the associated graphs. Anisotropic graph kernels are useful in many applications. For example, [15] improves the spectral clustering results with a self-tuning diffusion kernel that takes into account the local variance at each node in the Gaussian kernel function. Similarly, [55] uses the anisotropic Gaussian kernel defined by the local Mahalanobis distances to extract independent components from nonlinear measurements of independent stochastic Itˆo processes. Manifold learning with anisotropic kernel is also useful for data-driven dynamical system analysis, for example, detecting intrinsically slow variable for a stochastic dynamical system [56], filtering dynamical processes [57], and long range climate forecasting [58, 59]. The anisotropic diffusion is able to use the local statistics of the measurements to convey the geometric information on the underlying factors rather than the specific realization or measurements at hand [60, 61]. Published as a conference paper at ICLR 2019 Cora GCN-FP GGNN DCNN Cheby Net GCN MPNN Graph SAGE GAT LNet Ada LNet Public 74.6 0.7 77.6 1.7 79.7 0.8 78.0 1.2 80.5 0.8 78.0 1.1 74.5 0.8 82.6 0.7 79.5 1.8 80.4 1.1 3% 71.7 2.4 73.1 2.3 76.7 2.5 62.1 6.7 74.0 2.8 72.0 4.6 64.2 4.0 56.8 7.9 76.3 2.3 77.7 2.4 1% 59.6 6.5 60.5 7.1 66.4 8.2 44.2 5.6 61.0 7.2 56.7 5.9 49.0 5.8 48.6 8.0 66.1 8.2 67.5 8.7 0.5% 50.5 6.0 48.2 5.7 59.0 10.7 33.9 5.0 52.9 7.4 46.5 7.5 37.5 5.4 41.4 6.9 58.1 8.2 60.8 9.0 Citeseer GCN-FP GGNN DCNN Cheby Net GCN MPNN Graph SAGE GAT LNet Ada LNet Public 61.5 0.9 64.6 1.3 69.4 1.3 70.1 0.8 68.1 1.3 64.0 1.9 67.2 1.0 72.2 0.9 66.2 1.9 68.7 1.0 1% 54.3 4.4 56.0 3.4 62.2 2.5 59.4 5.4 58.3 4.0 54.3 3.5 51.0 5.7 46.5 9.3 61.3 3.9 63.3 1.8 0.5% 43.9 4.2 44.3 3.8 53.1 4.4 45.3 6.6 47.7 4.4 41.8 5.0 33.8 7.0 38.2 7.1 53.2 4.0 53.8 4.7 0.3% 38.4 5.8 36.5 5.1 44.3 5.1 39.3 4.9 39.2 6.3 36.0 6.1 25.7 6.1 30.9 6.9 44.4 4.5 46.7 5.6 Pubmed GCN-FP GGNN DCNN Cheby Net GCN MPNN Graph SAGE GAT LNet Ada LNet Public 76.0 0.7 75.8 0.9 76.8 0.8 69.8 1.1 77.8 0.7 75.6 1.0 76.8 0.6 76.7 +- 0.5 78.3 0.3 78.1 0.4 0.1% 70.3 4.7 70.4 4.5 73.1 4.7 55.2 6.8 73.0 5.5 67.3 4.7 65.4 6.2 59.6 +- 9.5 73.4 5.1 72.8 4.6 0.05% 63.2 4.7 63.3 4.0 66.7 5.3 48.2 7.4 64.6 7.5 59.6 4.0 53.0 8.0 50.4 +- 9.7 68.8 5.6 66.0 4.5 0.03% 56.2 7.7 55.8 7.7 60.9 8.2 45.3 4.5 57.9 8.1 53.9 6.9 45.4 5.5 50.9 +- 8.8 60.4 8.6 61.0 8.7 Table 1: Test accuracy with 10 runs on citation networks. The public splits in Cora, Citeseer and Pubmed contain 5.2%, 3.6% and 0.3% training examples respectively. 6 EXPERIMENTS In this section, we compare our two model variants against 9 recent graph networks, including graph convolution networks for fingerprint (GCN-FP) [29], gated graph neural networks (GGNN) [37], diffusion convolutional neural networks (DCNN) [8], Chebyshev networks (Cheby Net) [7], graph convolutional networks (GCN) [11], message passing neural networks (MPNN) [62], graph sample and aggregate (Graph SAGE) [39], graph partition neural networks (GPNN) [40], graph attention networks (GAT) [33]. We test them on two sets of tasks: (1) semi-supervised document classification on 3 citation networks [63], (2) supervised regression of molecule property on QM8 quantum chemistry dataset [64]. For fair comparison, we only tune model-related hyperparameters in all our experiments and share the others, e.g., using the same batch size. We carefully tune hyperparameters based on cross-validation and report the best performance of each competitor. Please refer to the appendix for more details on hyperparameters. We implement all methods using Py Torch [65] and release the code at https://github.com/lrjconan/Lanczos Network. 6.1 CITATION NETWORKS Three citation networks used in this experiment are: Cora, Citeseer and Pubmed. For each network, nodes are documents and connected based on their citation links. Each node is associated with a bag-of-words feature vector. We use the same pre-processing procedure and follow the transductive setting as in [63]. In particular, given a portion of nodes and their labeled content categories, e.g., history, science, the task is to predict the category for other unlabeled nodes within the same graph. The statistics of these datasets are summarized in the appendix. All experiments are repeated 10 times with different random seeds. During each run, all methods share the same random seed. We first experiment with the public data split and observe severe overfitting for almost all algorithms. To mitigate overfitting and test the robustness of models, we then increase the difficulty of the task by reducing the portion of training examples to several levels and randomly split data. Experimental results and exact portions of training examples are shown in Table. 1. We use the reported best hyperparameters when available for public split and do cross-validation otherwise. Hyperparameters are reported in the appendix. From the table, we see that for random splits with different portion of training examples, since each run of experiment uses a separate random split, the overall variance is larger than its public counterpart. We see that GAT achieves the best performance on the public split but performs poorly on random splits with different portions of training examples. This is partly due to the fact that GAT uses multiple dropout throughout the model which helps only if there is overfitting. We can see that either Lanczos Net or Ada Lanczos Net achieves state-of-the-art accuracy on random difficult splits and performs closely with respect to GAT on public splits. This may be attributed to the fact that with fewer training examples, the model requires longer scale schemes to spread supervised information over the graph. Our model provides an efficient way of leveraging such long scale information. Published as a conference paper at ICLR 2019 Methods Validation MAE ( 1.0e 3) Test MAE ( 1.0e 3) GCN-FP [29] 15.06 0.04 14.80 0.09 GGNN [37] 12.94 0.05 12.67 0.22 DCNN [8] 10.14 0.05 9.97 0.09 Cheby Net [7] 10.24 0.06 10.07 0.09 GCN [11] 11.68 0.09 11.41 0.10 MPNN [62] 11.16 0.13 11.08 0.11 Graph SAGE [39] 13.19 0.04 12.95 0.11 GPNN [40] 12.81 0.80 12.39 0.77 GAT [33] 11.39 0.09 11.02 0.06 Lanczos Net 9.65 0.19 9.58 0.14 Ada Lanczos Net 10.10 0.22 9.97 0.20 Table 2: Mean absolute error on QM8 dataset. 6.2 QUANTUM CHEMISTRY We then benchmark all algorithms on the QM8 quantum chemistry dataset which comes from a recent study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules [64]. The setup of QM8 is as follows. Atoms are treated as nodes and they are connected to each other following the structure of the corresponding molecule. Each edge is labeled with a chemical bond. Note that two atoms in one molecule can have multiple edges belong to different chemical bonds. Therefore a molecule is actually modeled as a multigraph. We also use explicit hydrogen in molecule graphs as suggested in [62]. Since some models cannot leverage feature on edges easily, we use the molecule graph itself as the only input information for all models so that it is a fair comparison. As demonstrated in our ablation studies, learning node embeddings for atoms is very helpful. Therefore, we augment all competitors and our models with this component. The task is to predict 16 different quantities of electronic spectra and energy per molecule graph which boils down to a regression problem. There are 21786 molecule graphs in total of which the average numbers of nodes and edges are around 16 and 21. There are 6 different chemical bonds and 70 different atoms throughout the dataset. We use the split provided by Deep Chem 2 which have 17428, 2179 and 2179 graphs for training, validation and testing respectively. Following [62, 66], we use mean squared error (MSE) as the loss for training and weighted mean absolute error (MAE) as the evaluation metric. We repeat all experiments 3 times with different random seeds and report the average performance and standard deviation. The same random seed is shared for all methods per run. Hyperparameters are reported in the appendix. The validation and test MAE of all methods are shown in Table 2. As you can see, Lanczos Net and Ada Lanczos Net achieve better performances than all other competitors. Note that DCNN also achieves good performance with the carefully chosen scale parameters since it is somewhat similar to our model in terms of leveraging multi-scale information. 6.3 ABLATION STUDY We also did a thorough ablation study of our modeling components on the validation set of QM8. Multi-Scale Graph Convolution: We first study the effect of multi-scale graph convolution. In order to rule out the impact of other factors, we use Lanczos Net, do not employ the learnable spectral filter and use the one-hot encoding as the node embedding. The results are shown in the first row of Table 3. Using long scales for graph convolution clearly helps on this task. Combining both short and long scales further improves results. Lanczos Step: We then investigate the Lanczos step since it will have an impact on the accuracy of the low rank approximation induced by the Lanczos algorithm. The results are shown in the second row of Table 3. We can see that the performance is better with a relatively small Lanczos step like 10 and 20 which makes sense since the average number of nodes in this dataset is around 16. Learning Spectral Filter: We then study whether learning spectral filter will help improve performance. The results are shown in the third row of Table 3. Adding a 3-layer MLP does help reduce the 2https://deepchem.io/ Published as a conference paper at ICLR 2019 Model Graph Kernel Node Embedding Spectral Filter Short Scales Long Scales Lanczos Step Validation MAE ( 1.0e 3) Lanczos Net one-hot {1, 2, 3} 10.71 Lanczos Net one-hot {3, 5, 7} 10.60 Lanczos Net one-hot {10, 20, 30} 20 10.54 Lanczos Net one-hot {3, 5 ,7} {10, 20, 30} 20 10.41 Lanczos Net one-hot {10, 20, 30} 5 10.49 Lanczos Net one-hot {10, 20, 30} 10 10.44 Lanczos Net one-hot {10, 20, 30} 20 10.54 Lanczos Net one-hot {10, 20, 30} 40 10.49 Lanczos Net one-hot 3-MLP {3, 5 ,7} {10, 20, 30} 20 10.44 Lanczos Net one-hot 5-MLP {3, 5 ,7} {10, 20, 30} 20 10.54 Lanczos Net 3-MLP {3, 5 ,7} {10, 20, 30} 20 10.26 Lanczos Net 3-MLP {1, 2, 3, 5, 7, 10, 20, 30} 20 9.56 Ada Lanczos Net one-hot 3-MLP {3, 5, 7} {10, 20, 30} 20 10.99 Ada Lanczos Net 3-MLP {3, 5, 7} {10, 20, 30} 20 10.20 Ada Lanczos Net 3-MLP {1, 2, 3} {5, 7, 10, 20, 30} 20 9.96 Table 3: Ablation study on QM8 dataset. Empty cell means that the component is neither used nor applicable. X-MLP means a MLP with X hidden layers. one-hot means the node embedding is fixed as the one-hot encoding throughout learning and inference. error compared to not using any learnable spectral filter. Note that the MLP consists of 128 hidden units per layer and uses Re LU as the nonlinearity. However, using a deeper MLP does not seem to be helpful which might be caused by the challenges in optimization. Graph Kernel/Node Embedding: At last, we study the usefulness of adding graph kernel and node embeddings. We first fix the node embedding as one-hot encoding and learn a 3 layer MLP which is the function fθ in Eq. (7). Next, we learn the node embeddings directly. Intuitively, learning embeddings amounts to learn a separate function f per node whereas our graph kernel learning enforces that f is shared for all nodes, thus being more restrictive. As shown in the 3-rd and 4th rows of the table, learning node embeddings significantly improves the performance for both Lanczos Net and Ada Lanczos Net and is more effective than learning graph kernels. Also, tuning the scale parameters further boosts the performance. 7 CONCLUSION In this paper, we propose Lanczos Net which leverages the Lanczos algorithm to construct a low rank approximation of the graph Laplacian. It not only provides an efficient way to gather multi-scale information for graph convolution but also enables learning spectral filters. Additionally, we propose a model variant Ada Lanczos Net which facilitates graph kernel and node embedding learning. We show that our model has a close relationship with graph based manifold learning, especially diffusion map. Experimental results demonstrate that our model outperforms a range of other graph networks, on challenging graph problems. We are currently exploring customized eigen-decomposition methods for tridiagonal matrices, which will potentially further improve our Ada Lanczos Net. Overall, work in this direction holds promise for allowing deep learning to scale up to very large graph problems. ACKNOWLEDGMENTS RL thanks Roger Grosse for introducing the Lanczos algorithm to him. RL was supported by Connaught International Scholarships. RL, RU and RZ were supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (Do I/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, Do I/IBC, or the U.S. Government. Published as a conference paper at ICLR 2019 [1] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018. [2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. ICLR, 2014. [3] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE TNN, 2009. [4] Antonio Ortega, Pascal Frossard, Jelena Kovaˇcevi c, Jos e MF Moura, and Pierre Vandergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 2018. [5] Andrew P Witkin. Scale-space filtering. In IJCAI, 1983. [6] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Multiscale methods. PNAS, 2005. [7] Micha el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016. [8] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In NIPS, 2016. [9] Amit Singer, Yoel Shkolnisky, and Boaz Nadler. Diffusion interpretation of nonlocal neighborhood filters for signal denoising. SIAM Journal on Imaging Sciences, 2009. [10] Neta Rabin and Dalia Fishelov. Multi-scale kernels for nystr om based extension schemes. Applied Mathematics and Computation, 319:165 177, 2018. [11] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. [12] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 2013. [13] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950. [14] Horst D Simon and Hongyuan Zha. Low-rank matrix approximation using the lanczos bidiagonalization process with applications. SIAM Journal on Scientific Computing, 2000. [15] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In NIPS, 2005. [16] Kilian Q Weinberger and Gerald Tesauro. Metric learning for kernel regression. In AISTATS, 2007. [17] Ronald R Coifman and St ephane Lafon. Diffusion maps. Applied and computational harmonic analysis, 2006. [18] Amit Singer. From graph to manifold laplacian: The convergence rate. Applied and Computational Harmonic Analysis, 2006. [19] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels. JMLR, 2010. [20] Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 2017. Published as a conference paper at ICLR 2019 [21] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In ICLR, 2018. [22] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016. [23] Alberto Garcia Duran and Mathias Niepert. Learning graph representations with embedding propagation. In NIPS, 2017. [24] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. In ICLR Workshop, 2018. [25] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In ICLR, 2018. [26] Fan RK Chung. Spectral graph theory. American Mathematical Soc., 1997. [27] David K. Hammond, Pierre Vandergheynst, and Rmi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 2011. [28] Mikael Henaff, Joan Bruna, and Yann Le Cun. Deep convolutional networks on graph-structured data. ar Xiv preprint ar Xiv:1506.05163, 2015. [29] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al an Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015. [30] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016. [31] Jie Chen, Tengfei Ma, and Cao Xiao. Fast GCN: Fast learning with graph convolutional networks via importance sampling. In ICLR, 2018. [32] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance reduction. In ICML, 2018. [33] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018. [34] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In CVPR, 2018. [35] Ana Susnjara, Nathanael Perraudin, Daniel Kressner, and Pierre Vandergheynst. Accelerated filtering on graphs using lanczos method. ar Xiv preprint ar Xiv:1509.04537, 2015. [36] Jordan B Pollack. Recursive distributed representations. Artificial Intelligence, 1990. [37] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. ICLR, 2016. [38] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016. [39] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017. [40] Renjie Liao, Marc Brockschmidt, Daniel Tarlow, Alexander L Gaunt, Raquel Urtasun, and Richard Zemel. Graph partition neural networks for semi-supervised classification. In ICLR Workshop, 2018. [41] Joan Bruna and Xiang Li. Community detection with graph neural networks. ar Xiv preprint ar Xiv:1705.08415, 2017. [42] Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, and Sanja Fidler. Situation recognition with graph neural networks. In ICCV, 2017. Published as a conference paper at ICLR 2019 [43] Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 3d graph neural networks for rgbd semantic segmentation. In ICCV, 2017. [44] Ki Jung Yoon, Renjie Liao, Yuwen Xiong, Lisa Zhang, Ethan Fetaya, Raquel Urtasun, Richard Zemel, and Xaq Pitkow. Inference in probabilistic graphical models by graph neural networks. ar Xiv preprint ar Xiv:1803.07710, 2018. [45] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. Nervenet: Learning structured policy with graph neural networks. In ICLR, 2018. [46] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. ar Xiv preprint ar Xiv:1806.01242, 2018. [47] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 2000. [48] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 2000. [49] David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. PNAS, 2003. [50] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, 2002. [51] Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In NIPS, 2006. [52] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 2007. [53] Xiaojin Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006. [54] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 2006. [55] Amit Singer and Ronald R Coifman. Non-linear independent component analysis with diffusion maps. Applied and Computational Harmonic Analysis, 2008. [56] Amit Singer, Radek Erban, Ioannis G Kevrekidis, and Ronald R Coifman. Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diffusion maps. PNAS, 2009. [57] Ronen Talmon and Ronald R Coifman. Empirical intrinsic geometry for nonlinear modeling and time series filtering. PNAS, 2013. [58] Dimitrios Giannakis. Dynamics-adapted cone kernels. SIAM Journal on Applied Dynamical Systems, 2015. [59] Zhizhen Zhao and Dimitrios Giannakis. Analog forecasting with dynamics-adapted kernels. Nonlinearity, 2016. [60] John Lafferty and Guy Lebanon. Diffusion kernels on statistical manifolds. JMLR, 2005. [61] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry. American Mathematical Soc., 2007. [62] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017. [63] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In ICML, 2016. [64] Raghunathan Ramakrishnan, Mia Hartmann, Enrico Tapavicza, and O Anatole von Lilienfeld. Electronic spectra from tddft and machine learning in chemical space. The Journal of chemical physics, 2015. Published as a conference paper at ICLR 2019 [65] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017. [66] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 2018. [67] Beresford N Parlett. The Symmetric Eigenvalue Problem. SIAM, 1980. [68] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press, 2007. Published as a conference paper at ICLR 2019 8.1 LOW RANK APPROXIMATION We first state the following Lemma from [67] without proof and then prove our Theorem 1 following [14]. Lemma 1. Let A RN N be symmetric and v an arbitrary vector. Define Krylov subspace Km span{v, Av, . . . , Am 1v}. Let A = UΛU be the eigendecomposition of A with Λi,i = λi and λ1 λn. Denoting U = [u1, . . . , u N] and Uj = span{u1, . . . , uj}, then tan (uj, Km) sin (v, Uj) Qj 1 k=1(λk λn)/(λk λj) cos (v, uj)Tm j(1 + 2γ) , where Tm j(x) is the Chebyshev Polynomial of degree m j and γ = (λj λj+1)/(λj+1 λN). Theorem 1. Let UΛU be the eigendecomposition of an N N symmetric matrix S with Λi,i = λi, λ1 λN and U = [u1, . . . , u N]. Let Uj span{u1, . . . , uj}. Assume K-step Lanczos algorithm starts with vector v and outputs the orthogonal Q RN K and tridiagonal T RK K. For any j with 1 < j < N and K > j, we have, sin (v, Ui) Qj 1 k=1(λk λN)/(λk λj) cos (v, ui)TK i(1 + 2γi) i=j+1 λ2 i , where TK i(x) is the Chebyshev Polynomial of degree K i and γi = (λi λi+1)/(λi+1 λN). Proof. From Lanczos algorithm, we have SQ = QT. Therefore, S QTQ 2 F = S SQQ 2 F = S(I QQ ) 2 F (11) Let P Q I QQ , the orthogonal projection onto the orthogonal complement of subspace span{Q}. Relying on the eigendecomposition, we have, S QTQ 2 F = UΛU (I QQ ) 2 F = ΛU (I QQ ) 2 F = (I QQ )UΛ 2 F = λ1P Q u1, . . . , λNP Q u N 2 F , (12) where we use the fact that RA 2 F = A 2 F for any orthogonal matrix R and A 2 F = A 2 F . Note that for any j we have, λ1P Q u1, . . . , λNP Q u N 2 i=1 λ2 i P Q ui 2 i=1 λ2 i P Q ui 2 + i=j+1 λ2 i , (13) where we use the fact that for any i, P Q ui 2 = ui 2 ui P Q ui 2 ui 2 = 1. Note that we have span{Q} = span{v, Sv, . . . , SK 1v} KK from the Lanczos algorithm. Therefore, we have, P Q ui = | sin (ui, KK)| | tan (ui, KK)|. (14) Applying the above lemma with A = S, we finish the proof. Published as a conference paper at ICLR 2019 Dataset #Nodes #Edges #Classes #Features S0 S1 S2 S3 Citeseer 3,327 4,732 6 3,703 3.6% 3% 1% 0.5% Cora 2,708 5,429 7 1,433 5.2% 1% 0.5% 0.3% Pubmed 19,717 44,338 3 500 0.3% 0.1% 0.05% 0.03% Table 4: Dataset statistics. S0 is portion of training examples in the public split. S1 to S3 are the ones of 3 random splits generated by us. 8.2 LANCZOS ALGORITHM Utilizing exact arithmetic, Lanczos vectors are orthogonal to each other. However, in floating point arithmetic, it is well known that the round-off error will make the Lanczos vectors lose orthogonality as the iteration proceeds. One could apply a full Gram-Schmidt (GS) process z = z Pj 1 i=1 z qiqi after line 6 of Alg. 1 to ensure orthogonality. Other partial or selective re-orthogonalization could also be explored. However, since we found the orthgonality issue does not hurt overall performance with a small iteration number, e.g., K = 20, and the full GS process is computationally expensive, we do not add such a step. Although some customized eigendecomposition methods, e.g., implicit QL [68], exist for tridiagonal matrix, we leave it for future exploration due to its complicated implementation. 8.3 EXPERIMENTS For Cheby Net, we do not use graph coarsening in all experiments due to its demanding computation cost for large graphs. Also, for small molecule graphs, coarsening generally does not help since it loses information compared to directly stacking another layer of original graph. Citation Networks The statistics of three citation networks are summarized in Table 4. We now report the important hyperparameters chosen via cross-validation for each method. All methods are trained with Adam with learning rate 1.0e 2 and weight decay 5.0e 4. The maximum number of epoch is set to 200. Early stop with window size 10 is also adopted. We tune hyperparameters using Cora alone and fix them for citeseer and pubmed. For convolution based methods, we found 2 layers work the best. In GCN-FP, we set the hidden dimension to 64 and dropout to 0.5. In GGNN, we set the hidden dimension to 64, the propagate step to 2 and aggregation function to summation. In DCNN, we set the hidden dimension to 64, dropout to 0.5 and use diffusion scales {1, 2, 5}. In Cheby Net, we set the polynomial order to 5, the hidden dimension to 64 and dropout to 0.5. In GCN, we set the hidden dimension to 64 and dropout to 0.5. In MPNN, we use GRU as update function and set the hidden dimension to 64 and dropout to 0.5. No edge embedding is used as there is just one edge type. In Graph SAGE, we set the number of sampled neighbors to 500, the hidden dimension to 64, dropout to 0.5 and the aggregation function to average. In GAT, we set the number of heads per layer to 8, 1, hidden dimension per head to 8 and dropout to 0.6. In Lanczos Net, we set the short and long diffusion scales to {1, 2, 5, 7} and {10, 20, 30} respectively. The hidden dimension is 64 and dropout is 0.5. Lanczos step is 20. 1-layer MLP with 128 hidden units and Re LU nonlinearity is used as the spectral filter. In Ada Lanczos Net, we set the short and long diffusion scales to {1, 2, 5} and {10, 20} respectively. The hidden dimension is 64 and dropout is 0.5. Lanczos step is 20. 1-layer MLP with 128 hidden units and Re LU nonlinearity is used as the spectral filter. Quantum Chemistry We now report the important hyperparameters chosen via cross-validation for each method. All methods are trained with Adam with learning rate 1.0e 4 and no weight decay. The maximum number of epoch is set to 200. Early stop with window size 10 is also adopted. For convolution based methods, we found 7 layers work the best. We augment all methods with 64dimension node embedding and add edge types by either feeding a multiple-channel graph Laplacian matrix or directly adding a separate message function per edge type. For all methods, no dropout is used since it slightly hurts the performance. In GCN-FP, we set the hidden dimension to 128. In GGNN, we set the hidden dimension to 128, the propagate step to 15 and aggregation function to average. In DCNN, we set the hidden dimension to 128 and use diffusion scales {3, 5, 7, 10, 20, 30}. In Cheby Net, we set the polynomial order to 5, the hidden dimension to 128. In GCN, we set the hidden dimension to 128. In MPNN, we use GRU as update function, set the number of propagation to 7, set the hidden dimension to 128, use a 1-layer MLP with 1024 hidden units and Re LU nonlinearity Published as a conference paper at ICLR 2019 as the message function and set the number of unroll step of Set2Vec to 10. In Graph SAGE, we set the number of sampled neighbors to 40, the hidden dimension to 128 and the aggregation function to average. In GAT, we set the number of heads of all 7 layers to 8 and hidden dimension per head to 16. In Lanczos Net, we do not use short diffusion scales and set long ones to {1, 2, 3, 5, 7, 10, 20, 30}. The hidden dimension is 128. Lanczos step is 20. 1-layer MLP with 128 hidden units and Re LU nonlinearity is used as the spectral filter. In Ada Lanczos Net, we set the short and long diffusion scales to {1, 2, 3} and {5, 7, 10, 20, 30} respectively. The hidden dimension is 128. Lanczos step is 20. 3-layer MLP with 4096 hidden units and Re LU nonlinearity is used as the spectral filter.