# sparse_neural_architectures_via_deterministic_ramanujan_graphs__0aecf631.pdf Published in Transactions on Machine Learning Research (12/2024) Sparse Neural Architectures via Deterministic Ramanujan Graphs Suryam Arnav Kalra suryamkalra35@gmail.com Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Arindam Biswas arin.math@gmail.com Polynom, Pairs, France Pabitra Mitra pabitra@gmail.com Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Biswajit Basu BASUB@tcd.ie School of Engineering Trinity College Dublin, Dublin 2, Ireland Reviewed on Open Review: https: // openreview. net/ forum? id= x8wsc CAJ2m We present a method to construct sparse neural networks using the theory of expander graphs. Expanders are sparse but well connected graph structures that are used for designing resilient networks. A Ramanujan graph is an extremal expander in terms of the spectral gap of its eigenvalues. In this work, bipartite Ramanujan expanders are deterministically constructed and used as connection structures of the convolutional and fully connected layers of a neural network. The Ramanujan graphs occur either as Cayley graphs of certain algebraic groups or as Ramanujan r-coverings of the full (k, l) biregular bipartite graph on k + l vertices. The proposed sparse networks are found to provide comparable performance to a fully dense network on benchmark datasets achieving an extremely low network density. 1 Introduction Sparse neural networks are structures containing significantly lesser number of connection weights as compared to a fully connected network. They have advantages in terms of inference time, memory requirements and generalization capabilities. A sparse neural network should be trainable to achieve a high prediction accuracy. Goal of our work is to develop a theoretically justified strategy for constructing sparse neural networks that can be subsequently trained to achieve a good prediction performance. Expander graphs are networks that are simultaneously sparse and highly connected (Hoory et al., 2006). Expanders are resilient in the sense that a large number of their edges needs to be deleted to disconnect parts of the graph. A higher spectral gap between the first and the second eigenvalues of a graph adjacency matrix points towards a better expansion property. Ramanujan graphs (Lubotzky et al., 1988) are a class of regular expander graphs with maximally high spectral gaps. Path connectedness and regularity are desired characteristics of a neural network both for forward propagation and gradient flow. These properties help even a sparse network to achieve a good prediction performance. It has been empirically shown that the expansion property is strongly correlated with the performance of sparse neural networks (Prabhu et al., 2018; Pal et al., 2022). This motivates the use of Ramanujan graph construction algorithms to design sparse neural networks. Published in Transactions on Machine Learning Research (12/2024) In our approach, bipartite Ramanujan graphs are constructed for each of the convolutional and fully connected layers. These are stacked to form the entire network. The network is then randomly initialized with weight values and trained using backpropagation. Sparsity can be controlled by varying the parameters of the Ramanujan graph. The Ramanujan graphs are constructed either as Cayley graphs of certain algebraic groups or as Ramanujan r-coverings of the full (k, l) biregular bipartite graph on k + l vertices. Prior approaches to using Ramanujan expander graphs have relied on iterative weight pruning techniques or random graph generation. This often leads to the formation of random expander graphs. Our approach of constructing a deterministic Ramanujan network circumvents this problem. The construction is also data independent and non-iterative. Experimental results on benchmark image classification data sets show that Ramanujan sparse network initialization provides comparable performance with fully dense networks. The paper is organized as follows. We present a brief literature survey in Section 1.1. Contributions of the paper are highlighted in Section 2. The properties and mathematical formulation of deterministic Ramanujan graphs are then presented in Section 3. The proposed construction techniques of sparse neural network layers is described in Section 4. Finally, the experimental results are outlined in Section 5. 1.1 Related Work Sparse neural networks have been previously studied in terms of their graph theoretic properties (Liu et al., 2023). Connectivity properties have been used by several authors to define a sparse network topology (Vysogorets & Kempe, 2023; You et al., 2020; Chen et al., 2022; 2023a). Tam and Dunson in (Tam & Dunson, 2023) gave a technique for regularization of a neural network taking into account the connectivity property of the underlying graph. L0 regularization based training which encourages zero weights has been extensively studied previously (Louizos et al., 2018). More sophisticated regularization techniques have also been used (Refael et al., 2024). Existence of sparse trainable high performing subnetworks of a fully connected network is suggested by the well known lottery ticket hypothesis (Frankle & Carbin, 2019). These subnetworks are often identified using weight pruning techniques (Orseau et al., 2020). Initial research works in this direction were based on applying established pruning algorithms on a partially trained network (Renda et al., 2020; Fischer & Burkholz, 2022). Recently, a number of approaches has been suggested to obtain a sparse mask for pruning at initialization (Frankle et al., 2020; Pham et al., 2023; Wang et al., 2021; Sreenivasan et al., 2022; Cheng et al., 2023). Robust methods of pruning at initialization have been proposed in (Hayou et al., 2021). Expander based sparse network generation has been studied in (Stewart et al., 2023). The methodology is based on generating random d-regular graphs for the bipartite layers and then filtering based on their expansion property. However, it is still a conjecture that random d-regular graphs are Ramanujan. A deep expander sparse network, the X-Net, is presented in (Prabhu et al., 2018). It is constructed by sampling d-left regular graphs from the space of all bipartite graphs. Radi X-Net (Kepner & Robinett, 2019) is a related deterministic sparse neural architecture defined over diverse layer structures. Ramanujan graph based sparse networks are proposed in (Esguerra et al., 2023). Spectral sparsification is a general method of obtaining expander like neural networks (Laenen, 2023). Expander networks used for this purpose are obtained by first generating random bipartite graphs for each layer, and then selecting the ones with a large spectral gap. However, these often favors random networks which are sensitive to reinitialization (Ma et al., 2021; Hoang et al., 2023). Spectral measures have been proposed in (Hoang et al., 2023) to prevent these effects. 2 Research Gap and Contributions Most of the popular techniques for sparse network construction rely on pruning a partially trained or an untrained network. They usually consider certain heuristic weight importance functions. Expander graph networks are more theoretically justified. Existing expander graph construction methods are based on random graph generation. This does not always guarantee path-connectedness and regularity properties that are crucial for neural networks. The proposed deterministic Ramanujan graph sparse neural network construction technique circumvents these challenges. Published in Transactions on Machine Learning Research (12/2024) Figure 1: An example of two consecutive (edge) neural layers where each layer is a bipartite Ramanujan graph. Note that one obtains something stronger than layer-wise path-connectivity. From each vertex of (vertex) layer I one has at least d1d2 pathways to reach layer K where d1, d2 denotes the regularities of layers I, J respectively. For instance, to reach layer K from i1 there are the following 9 pathways: i1 j1 k2, i1 j1 k3, i1 j1 k4, i1 j2 k3, i1 j2 k4, i1 j2 k5, i1 j3 k1, i1 j3 k4, i1 j3 k5. The principal contributions of our paper are: 1. A theoretically justified methodology for generating sparse neural networks using Ramanujan graphs. 2. The method is deterministic and is guaranteed to construct well connected and regular expander graphs. It is also non-iterative and data independent. 3. We experimentally show that even extremely sparse networks can be trained to achieve comparable accuracy to a fully dense network. 3 Properties of the Constructed Networks Path connectedness is a crucial property of neural network graph structures. It is necessary both for forward propagation during inference as well as for gradient flow during backpropagation (Sporns, 2003; Alford et al., 2019). Symmetry and regularity of the layer-wise adjacency matrices is another important property necessary for fast computation and memory efficiency (Chen et al., 2023b; Mao et al., 2017). We elaborate on them in the context of Ramanujan graphs. Path-connectedness: The fact that each layer of the bipartite graphs are either regular or biregular with the regularity bigger than 3 ensures that the entire architecture remains path-connected, i.e., starting from any node in the first layer we can reach a node in the last layer by a connected path. A proof of this is direct. In Figure 1, suppose there are 3 layers I, J, K. We wish to reach layer K starting from any point in layer I by a connected path. Pick any ir, r {1, 2, 3, 4, 5}. Use the fact that there is at least one edge going out from ir to reach some js and from js again use the fact of outgoing edges bigger than 1 to reach a point in layer K. The general case follows by induction on the number of layers. High-symmetry: The adjacency matrices of Cayley graphs and of covers of Cayley graphs have much more symmetry than those of general regular graphs. For instance they are vertex transitive. Often computations are optimised to use such symmetry viz. in the case of the software GAP (Group, 2022) as an example. In the future we wish to explore this direction of using the underlying symmetry to obtain fast computations on these sparse expander networks. This is also one of the reasons why we prefer deterministic constructions over random ones as the latter loses symmetry. In the graph of Figure 1, the adjacency matrix of the first layer (which can be represented as a Cayley graph on the group G = Z2 Z51 = {(x, y) : 0 x 1, 0 y 4} 1Zn for n > 1 denotes the group of integers under addition modulo n and is the Cartesian product of groups. Published in Transactions on Machine Learning Research (12/2024) with generating set S = {(1, 0), (1, 1), (1, 4)}) is Adj = 05 5 B5 5 BT 5 5 05 5 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 . Our goal is to furnish an efficient sparse neural network architecture. If we can describe the architecture for each bipartite layer of neurons then we are done. The adjacency matrix of a graph captures the structure of the graph. So if we can describe the adjacency matrix or the way vertices of each bipartite layer are connected to other vertices, then it is sufficient to achieve our goal. Previous approaches based on random network initialization and pruning strategies suffer from the issue of irregularity and are not guaranteed to be rigorously Ramanujan. For instance, application of the work of Hoory (Hoory, 2005) as mentioned in (Pal et al., 2022; Hoang et al., 2023) etc depends on the important fact that the minimal degrees of the base bipartite graphs need to be 2 for the graphs to be Ramanujan. Our architecture based on deterministic regular Ramanujan graphs of degree 3 ensures that the initialized networks remain Ramanujan, are path-connected and are highly symmetric, being either Cayley graphs of certain algebraic groups to replace the balanced dense bipartite graphs, or the Ramanujan r-covering of full biregular bipartite graphs to replace the unbalanced dense bipartite graphs. 4 Formulation of Sparse Neural Ramanujan Graphs We construct a set of bipartite Ramanujan graphs of specified sparsity and then stack them as convolutional and fully connected layers of the sparse neural network. In this section we present the mathematical framework which allows us to construct these bipartite Ramanujan graphs in a deterministic manner. This forms the basis of our strategy for sparse network construction. The section proceeds as follows: 1. We shall first present the base graphs on which our sparse neural architectures will be modelled. These are the bipartite Ramanujan graphs. See Definition 4.1 2. Next we shall give the method of construction of these graphs. These graphs are used to define the neural network layers. Recall that a Ramanujan graph is an extremal expander graph in the sense that its spectral gap is almost as large as possible. Here, we shall be concerned with bipartite Ramanujan graphs. Recall that a bipartite graph is said to be balanced if the number of vertices in each of the partitions are the same and it is said to be unbalanced otherwise. Definition 4.1 (Bipartite Ramanujan graphs). Let Γ = (V, E) be a d-regular (d 3) balanced bipartite graph. Let the eigenvalues of its adjacency matrix be λn λn 1 . . . λ2 λ1. Then Γ is said to be Ramanujan iff |λi| 2 d 1, for i = 2, . . . , (n 1). For an unbalanced (d1, d2) biregular bipartite graph (d1, d2 3), the condition of being Ramanujan changes to |λi| d1 1 + d2 1, for i = 2, . . . , (n 1). We see that when d1 = d2, it reduces to the usual definition. A representation of an unbalanced biregular bipartite Ramanujan network, see Figure 2. Note that we are considering undirected graphs, so the adjacency matrix is a 0 1 symmetric matrix and the eigenvalues are all real. A bipartite graph has adjacency eigenvalues symmetric around 0. A detailed description of Ramanujan graphs can be found in (Hoory et al., 2006, sec. 5.3). Ramanujan graphs are excellent spectral expanders. They are also extremely difficult to construct. In fact, even the question of existence of (infinite families of) Ramanujan graphs is a non-trivial one and it is not yet Published in Transactions on Machine Learning Research (12/2024) Figure 2: An example of two consecutive neural layers where each layer is a biregular bipartite Ramanujan graph. This can be checked from computing the adjacency eigenvalues of each and comparing with the Ramanujan bound for biregular bipartite graphs. The biregularity of the first layer is (2, 4) and that of the second is (4, 3). Note that if we have an unbalanced biregular, bipartite network with (n1, n2) vertices and biregularity (d1, d2) then they must be related by the following equation: n1d1 = n2d2. fully resolved for the non-bipartite case. For the bipartite case it has been resolved by the recent works of Markus Spielmann Srivastava (Marcus et al., 2015; 2018) and Gribinski Markus (Gribinski & Marcus, 2021). The first such construction of graphs are due to Lubotzky Phillips Sarnak (LPS) (Lubotzky et al., 1988) (and independently by Margulis (Margulis, 1988)). We shall define our pruned network according to these constructions. 4.1 Theoretical framework Let us begin this section by recalling the Cheeger constant and the Cheeger inequality which will help to shed light on the fact that why good spectral expanding networks are closely related with pruned networks. In the following, a graph Γ = (V, E) is a tuple consisting of a vertex set V and an edge set E which is a subset of V V . 4.1.1 Combinatorial Expansion Definition 4.2 (Expander and Cheeger constant). A graph Γ = (V, E) is an ϵ-vertex expander if for every non-empty subset X V with |X| |V | 2 , we have |δ(X)| |X| ϵ, where δ(X) denotes the outer vertex boundary of X i.e., the set of vertices in Γ which are connected to a vertex in X but do not lie in X. As X runs over all subsets of V , the infimum of |δ(X)| |X| satisfying the conditions above is known as the vertex Cheeger constant and is denoted by h V (Γ). Similar to the above, when we consider the edge boundary i.e., the set of edges which have one vertex in X and the other outside of X, we obtain the edge Cheeger constant h E(Γ) . The vertex Cheeger constant h V (Γ) and the edge Cheeger constant h E(Γ) are related by the following equivalence h V (Γ) D h E(Γ) h V (Γ), where D denotes the maximum degree of the graph. The equivalence allows us to speak about vertex expansion and edge expansion interchangeably. In the literature, having a high Cheeger constant is also known as having high combinatorial expansion. Intuitively, given a graph with high vertex (or edge) Cheeger constant, it is more difficult to separate any subset of the vertices from the rest of the graph. This allows for free flow of information throughout the network which the graph defines. Ideally, we want our underlying graphs on which the neural layers are based to have high Cheeger constants. In other words, we are targeting for the base graphs of the sparse neural architectures to have layerwise high Cheeger constants. However, Published in Transactions on Machine Learning Research (12/2024) computing the Cheeger constants of graphs is in general an NP-hard problem. To overcome this issue we shall use spectral techniques. 4.1.2 Spectral Expansion Given a finite undirected graph Γ, the eigenvalues λn λ1 of its adjacency matrix are all real and λ1 D with equality iff the graph is D-regular. (a graph is said to be d-regular if there are exactly d-edges attached to a vertex.) Thus, a d-regular bipartite graph is a graph which has the same number of vertices in each partition and every vertex of each partition has exactly d edges attached to it. A graph Γ = (V, E) is said to be a spectral expander if the quantities {|λ1| |λ2|, |λ1| |λk|} are both strictly positive, where k = n 1 if the graph is bipartite and k = n otherwise. Question 4.3. Why is there a need to establish strong spectral expansion? The answer to this lies in the fact that spectral expansion implies combinatorial expansion via the discrete Cheeger-Buser inequality (see appendix A.2.3) and the bigger the spectral expansion, the more combinatorially expanding the base network graphs are. In general combinatorial expansion (counting the values of the vertex or the edge Cheeger constants) is an NP-hard problem. Now the natural question arises that whether there is a limit to the spectral expansion or can it become as large as possible. This leads us to Ramanujan graphs which are the optimal spectral expanders. See appendix for the Alon-Bopanna theorem, Thm A.3. Now, we proceed to the construction of the deterministic Ramanujan networks. We shall need the following notions from arithmetic. Definition 4.4 (Quadratic residue and Legendre symbol). An integer q is called a quadratic residue modulo n if it is congruent to a perfect square modulo n, i.e., if there exists an integer x such that x2 q (mod n). Otherwise, q is called a quadratic non-residue modulo n. Let p be an odd prime number and a be an integer. The Legendre symbol of a and p is defined as 1 if a is a quadratic residue modulo p and a 0 (mod p), 1 if a is a quadratic non-residue modulo p, 0 if a 0 (mod p). Given a prime a, there are infinitely many primes p such that Legendre symbol of a and p is 1 (and also there are infinite many primes p such that it is +1). Motivation: The bipartite Ramanujan networks will be constructed as Cayley graphs of certain matrix groups over finite fields. In other words, the vertices will be matrices while the coefficients of the matrices will be elements of certain finite fields. How each matrix element is connected to the other matrix elements (essentially the same thing as saying how each vertex in the graph network is connected to any other vertex) will depend on the Quadratic residue and the Legendre symbols defined above. We shall now see which matrix groups we will be working with. Definition 4.5 (PGL2(K) projective linear group over a field K). Let K be any field. Let us denote by GL2(K) the group of invertible 2-by-2 matrices with coefficients in K, ie, the matrices with non-zero determinant. Let PGL2(K) be the quotient group PGL2(K) = GL2(K)/Z(K) GL2(K) = a b c d : ad bc = 0 Z(K) = a 0 0 a In these definitions it is assumed that all entries are elements of the field K. Remark: If K = Fq (the finite field of q elements), then a simple computation gives the size of PGL2(Fq) Published in Transactions on Machine Learning Research (12/2024) to be |PGL2(Fq)| = q(q2 1). In Section 4.3, we shall use this property to construct bipartite q(q2 1) 2 Ramanujan networks. For more details see appendix, A.3. 4.2 Regular Ramanujan graphs Let p, q 1(mod 4) be distinct odd primes (the condition of 1(mod 4) can be removed at the cost of making the analysis more technical and complicated, we shall mention later how it is achieved). The graph Xp,q is constructed using the following general method ((Lubotzky et al., 1988)). 1. Choice of group: It will be a Cayley graph (for the notion of Cayley graphs, see Appendix A.1) on PGL2(Fq). Thus the vertices of the graph will be elements of PGL2(Fq). Next, we need to take care of the edges. 2. Method of selecting certain special matrices within PGL2(Fq): Consider the equation a2 0+a2 1+a2 2+a2 3 = p. Jacobi s four square theorem states that there are p+1 solutions to the equation a2 0+a2 1+a2 2+a2 3 = p with a0 > 0 odd (i.e., a0 1 (mod 2)) and a1, a2, a3 even. The role of the finite field Fq will now come into play. We have a set of solutions {(a0, a1, a2, a3)} (in the integers) of size p + 1 to the equation a2 0+a2 1+a2 2+a2 3 = p. For each such solution (a0, a1, a2, a3) consider the matrix a0 + ia1 a2 + ia3 a2 + ia3 a0 ia1 where i is some fixed integer solution to i2 = 1 (mod q). We will obtain a set of (p + 1) matrices each of which belongs to PGL2(Fq) (can be checked from the definition of PGL2(Fq)). For example, suppose we take q = 5, p = 13, and i = 3 (where i2 1 mod 5). Let us choose the solution: a0 = 3, a1 = 0, a2 = 0, a3 = 2, to the equation a2 0 + a2 1 + a2 2 + a2 3 = 13. Now, we construct the matrix: a0 + ia1 a2 + ia3 a2 + ia3 a0 ia1 = 3 + 3 0 0 + 3 2 0 + 3 2 3 3 0 Next, we reduce the entries modulo 5 (since q = 5): 3 6 mod 5 6 mod 5 3 Thus, an example of such a chosen matrix is: 3 1 1 3 This matrix is one of the p + 1 = 14 matrices generated from the solutions to the equation a2 0 + a2 1 + a2 2 + a2 3 = 13 using the process described. 3. Form the generating set S of the Cayley graph (see Appendix A.1) to be the set of the (p + 1) matrices created above. The vertices are the elements of PGL2(Fq) and the edges are given {(x, xs) : x PGL2(Fq), s S}. In other words, Xp,q = Cay(PGL2(Fq), S) cf. A.1. The graphs are bipartite iff p is not a quadratic residue modulo q or in other words the Legendre symbol q p = 1. The bipartite graphs Xp,q will be (p + 1)-regular, of size q(q2 1) 2 by q(q2 1) 2 and are Ramanujan. For a proof see (Lubotzky et al., 1988). Remark: If p 3 (mod 4), then a similar strategy is employed, except in this case one looks at solutions of a2 0 + a2 1 + a2 2 + a2 3 = p with a0 0 (mod 2). See (Musitelli & de la Harpe, 2006, sec. 2). Published in Transactions on Machine Learning Research (12/2024) 4.3 Construction for the fully connected layers For the fully connected layers consisting of balanced bipartite graphs of size m m and regularity m where m is the number of vertices on each side of the graph, we prune them at initialization in accordance with the Ramanujan graph structure of Lubotzky Phillips Sarnak (Lubotzky et al., 1988). 1. Select the largest value of prime q 1 (mod 4) such that q(q2 1) 2. Select a prime p 1 (mod 4) such that the Legendre symbol (4.4) q p = 1 (note that this choice is always possible as given a prime q there are infinite number of primes p satisfying this property) 3. Follow the construction using these values of p and q as given in Section 4.2. Selecting the minimum possible value of p will give us the sparsest possible Ramanujan graph. For example, consider a 4096 4096 original fully connected layer, our choice of (p, q) = (5, 17) gives rise to bipartite sparse Ramanujan graphs of size 2448 2448 with regularity 6. Note: Here we have taken p 1 (mod 4), but we could have also chosen p 3 (mod 4) or even p 2 (mod 4) (see construction of cubic Ramanujan graphs (Chiu, 1992)) resulting in even sparser networks. 4.4 biregular Ramanujan graphs A bipartite graph is said to be (d1, d2) biregular if each bi-partition has fixed regularity d1, d2 respectively. Note that a simple computation reveals that if (n1, n2) are the bi-partition sizes, then n1d1 = n2d2. Thus three parameters are needed to specify these types of graphs. One way to construct biregular Ramanujan graphs is the following, see (Burnwal et al., 2021): 1. Select a prime q and create a q q cyclic shift permutation matrix P = [P]ij with [P]ij = 1 if j = i 1 (mod q) and 0 otherwise. 2. Select any value of l q and create the bi-adjacency matrix B = Iq Iq . . . Iq Iq P . . . P l 1 Iq P 2 . . . P 2(l 1) ... Iq P q 1 . . . P (q 1)(l 1). where Iq is the q q identity matrix and P is as above. 3. The biadjacency matrix B can be used to generate adjacency matrix of any m n bipartite graph as Adj = 0m m Bm n BT m n 0n n B is a q2 lq matrix and the bipartite graph is of size q2 lq with biregularity (l, q). The graphs whose bi-adjacency matrices are represented as B are Ramanujan. These graphs are explicit realisations of the Ramanujan r-coverings of the full (k, l) biregular bipartite graph on k + l vertices as shown in (Hall et al., 2018, cor 2.2). Note: The condition q l is not a necessary requirement. The reason behind this flexibility lies in the specific properties of the bi-adjacency matrix B, which has dimensions q2 by lq. The critical insight is that if B creates a Ramanujan graph, then its transpose, denoted as BT (with dimensions lq by q2), also creates a Ramanujan graph. 4.5 Construction for the convolutional layers The convolutional layer can be thought of as a matrix of dimensions |Nout| |Nin| |Kw| |Kh| where |Nout| is the number of output channels, |Nin| is the number of input channels, |Kw| is the kernel width and Published in Transactions on Machine Learning Research (12/2024) |Kh| is the kernel height. This is considered to be a bipartite graph with |Vleft| = |Nin| |Kw| |Kh| and Vright = |Nout| where each vertex of Vleft has an edge with each vertex of Vright. 1. Select the largest value of prime q such that q2 |Vleft|. 2. Select the largest value of l such that l q and lq |Vright|. 3. Follow the construction as given in Section 4.4. The pruning mask thus obtained is a matrix of size q2 lq with the effective number of connections being equal to q2 l. The original pruning mask of the convolutional layer has size |Nout| [|Nin| |Kw| |Kh|]. By construction the obtained Ramanujan graph is actually a subgraph of the original pruning mask and thus the entries in the original mask not part of the constructed Ramanujan graph are set to 0. Note: The above construction assumes Vleft Vright. If this is not the case, we can simply swap the roles of q and l since from the construction (4.4), we know BT also creates Ramanujan graphs. 4.6 Time Complexity of Construction The time complexity for constructing ramanujan graphs for fully connected layers following Section 4.3 is mainly dominated by the creation of the PGL2 group in which first we need to create the generator matrix and then find the equivalence classes which takes time O(q4 q). The solution to the four square problem has complexity O(p4) giving an overall complexity of O(q5 + p4). Similarly for constructing Ramanujan graphs for convolutional layers as given in Section 4.5 the time complexity comes out to be O(q2 lq) since we need to create the bi-adjacency matrix B. 4.7 General bipartite networks In the case of bipartite networks with arbitrary sizes, one can achieve as sparse Ramanujan graphs as possible. It has recently been proven by Marcus, Spielman and Srivastava that for the regular case, for each degree d 3, infinite families of bipartite Ramanujan graphs exist. This is also true for the biregular case, for each pair (d1, d2) with d1, d2 3, d2 = kd1, k 2. Further, they showed the existence of these types of graphs of all sizes. Their method of proof is probabilistic and existential in nature. It does not give explicit families of bipartite Ramanujan graphs. However there now exist polynomial time algorithms (Cohen, 2016) (for the regular case) (Gribinski & Marcus, 2021) (for the biregular case) with which we can extract explicit Ramanujan graphs. For the regular case, we fix an integer n 3 and a degree d 3 and in the output we shall obtain a d-regular n n bipartite Ramanujan graph while for the biregular case, we fix three integers n, k, d with n > 2, d > 2, k 2 and obtain (d, kd) biregular Ramanujan graph of size kn n. 5 Experimental Methodology and Results The goal of our experiments is to study the effectiveness of deterministic Ramanujan graph based sparse network construction. 5.1 Datasets and architectures The datasets used for the experiments are CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). The experiments are performed over a variety of architectures including VGG13, VGG16, VGG19 (Simonyan & Zisserman, 2014), Alex Net (Krizhevsky et al., 2012), Res Net18 and Res Net34 (He et al., 2016) to show the robustness of our method. We proceed in two parts. In the first part, we prune the Fully Connected layers by replacing them with sparse Ramanujan Graph which is applicable for VGG13, VGG19 and Alex Net architectures. In the second part, we prune the whole network including the Convolutional layers and the Fully Connected layers which is applicable for all the architectures considered in our experiment. The performance of the fully dense and the pruned networks are compared in each case. Finally, we compare the performance of our method against various sparse neural networks obtained by state-of-the art pruning at initialisation algorithms Published in Transactions on Machine Learning Research (12/2024) for VGG16 and the Res Net34 architectures. Training parameters for all of the architectures are same and are summarized in Table 1. We report accuracy on a randomly split 16% test set for all the experiments. Table 1: Training Parameters for the experiment Hyperparmeters Epochs 200 Train Batch Size 256 Test Batch Size 128 Learning Rate 0.1 LR Decay, Epoch 10x, [100, 150] Optimizer SGD Weight Decay 0.0005 Momentum 0.9 Weight Initialization Kaiming Uniform 5.2 Methods compared The performance of the pruned networks is compared with that of a fully dense network having similar number of nodes in each layer. We also compare our method with other sparse neural networks obtained by state-of-art pruning at initialization techniques. Like our approach, these methods generate sparse connectivity structures which are initialized to random weight values and then trained using backpropagation. The pruning methods compared include, Random (Liu et al., 2022), ERK (Evci et al., 2020; Mocanu et al., 2018), Gra SP (Wang et al., 2020), and Syn Flow (Tanaka et al., 2020). The number of iterations used for Syn Flow are 100 and for ERK and Gra SP it is 1 keeping the rest of the hyperparameters same as in Table 1. 5.3 Network construction parameters Experiments are conducted in two parts: 1) Considering sparse Fully Connected layers only, 2) Considering the entire network to be sparse including the Convolutional and the Fully Connected layers. For the first part, we consider sparse fully connected layers using the construction of Ramanujan graphs as given in Section 4.3. We have used a dedicated fully connected layer of size 4096 4096 for VGG13, VGG19 and Alex Net architectures and the values used for p and q (Section 4.3) are given in Table 2. This results in the fully connected layer becoming of size q(q2 1)/2 q(q2 1)/2 with the effective number of connections between the layer being equal to q (q2 1)/2 (p + 1). Table 2: Values of p and q used to generate the sparse fully connected layer Model VGG13, VGG19, Alex Net FC Layer Size 4096 4096 p 29, 109 q 17 For the second part, we consider sparse convolutional layers using the construction of Ramanujan graphs as given in Section 4.5 in addition to the fully connected layers. The size of convolutional layer being pruned and the values of l and q used (Section 4.5) for VGG16 and Res Net34 architectures is given in Table 3, while for the rest of the architectures is given in Table 7. The network density (ratio of number of edges in a sparse network to a fully dense network) reported in Section 5.4 is calculated by dividing the number of connections which is equal to the sum of q (q2 1)/2 (p + 1) (number of connections in fully connected layer) and q2 l (number of connections in each of the convolutional layers) divided by the total number of connections present in the fully dense network with the similar number of nodes. Published in Transactions on Machine Learning Research (12/2024) Table 3: Values of q and l to generate Ramanujan Graphs for layers of VGG16 and Res Net34 VGG16 Res Net34 convolutional layer Size q l convolutional layer Size q l [128 64 3 3] 1 11 52 [64 64 3 3] 6 7 82 [128 128 3 3] 1 11 104 [128 64 3 3] 1 11 52 [256 128 3 3] 1 13 88 [128 128 3 3] 7 11 104 [256 256 3 ] 2 13 177 [256 128 3 3] 1 13 88 [512 256 3 3] 1 19 121 [256 256 3 3] 11 13 177 [512 512 3 3] 5 19 242 [512 256 3 3] 1 19 121 [512 512 3 ] 5 19 242 5.4 Results and discussion We study the accuracy of sparse networks obtained by our technique for various architectures and datasets. The accuracy is compared with that of a fully dense network with similar number of nodes. Results for the first part of the experiment where only the intermediate fully connected layer is sparse, are summarized in Table 4. By definition, network density is the ratio of number of edges in the sparse network to the number of edges in a fuly dense network. It can be observed that the Ramanujan graph construction allows us to extremely sparsify the fully connected layer to a network density of 0.43% while still having comparable the accuracy as of the similar fully dense model. Table 4: Accuracy of VGG and Alex Net when only the FC layer is sparsified Dataset: CIFAR-10 FC layer Size (Network Density) 4096 4096 (Unpruned) 2448 110 (1.6%) 2448 30 (0.43%) VGG13 92% 91% 91% VGG19 92% 92% 92% Alex Net 86% 84% 86% Dataset: CIFAR-100 FC layer Size (Network Density) 4096 4096 (Unpruned) 2448 110 (1.6%) 2448 30 (0.43%) VGG13 66% 66% 63% VGG19 66% 67% 63% Alex Net 67% 66% 66% For the second part of the experiment where we sparsify the complete network including the convolutional layers and the fully connected layer, we could achieve an overall network density of 2% to 5% for VGG, 2.3% for Alex Net and 5% for the Res Net architectures. The accuracy of the models on the CIFAR-10 and CIFAR-100 datasets are summarized in Table 5. A small accuracy drop is observed as compared to the fully dense network. Table 5: Accuracy of various architectures when the complete network is sparsified including the Convolutional and the FC layers Dataset: CIFAR-10 Model Unpruned accuracy Pruned Accuracy Network Density VGG13 92% 90% 1.7% VGG16 93% 91% 5.3% VGG19 92% 89% 2.4% Alex Net 86% 82% 2.3% Res Net18 87% 86% 5.6% Res Net34 88% 86% 5.2% Dataset: CIFAR-100 Model Unpruned accuracy Pruned Accuracy Network Density VGG16 70% 66% 5.3% Res Net18 55% 54% 5.6% Res Net34 57% 56% 5.2% Dataset: Tiny-Image Net Model Unpruned accuracy Pruned Accuracy Network Density VGG16 43% 40% 5.3% Res Net34 56% 48% 5.2% Published in Transactions on Machine Learning Research (12/2024) Finally, we compare the performance of the proposed Ramanujan sparse network initialization with sparse networks obtained by state-of-art pruning at initialization techniques. The comparison of accuracy between various pruning at initialization techniques at network density 5% is shown for the VGG16 and Res Net34 architectures in Table 6. Table 6: Comparison of our method against sparse neural networks obtained by pruning at initialization Dataset: CIFAR-10 VGG16 (Network Density 5.3%) Method Accuracy Unpruned 93% Our Method 91% Random 89% ERK 91% Syn Flow 92% Res Net34 (Network Density 5.2%) Method Accuracy Unpruned 88% Our Method 86% Random 81% ERK 86% Gra SP 86% Dataset: CIFAR-100 VGG16 (Network Density 5.3%) Method Accuracy Unpruned 70% Our Method 66% Random 60% ERK 62% Syn Flow 65% Res Net34 (Network Density 5.2%) Method Accuracy Unpruned 57% Our Method 56% Random 50% ERK 56% Gra SP 56% Dataset: Tiny-Image Net VGG16 (Network Density 5.3%) Method Accuracy Unpruned 43% Our Method 40% Random 35% ERK 39% Syn Flow 40% Res Net34 (Network Density 5.2%) Method Accuracy Unpruned 56% Our Method 48% Random 45% ERK 47% Gra SP 29% We can observe that the proposed sparse neural network can achieve comparable accuracy to other sparse networks obtained by iterative pruning at initialization techniques. It also significantly outperforms the randomly connected sparse network. The accuracy of the Ramanujan graph based sparse networks remains close to that of a similar fully dense network even at a low network density. 6 Conclusion and Future Work We presented a technique for constructing sparse neural networks that can be trained to a high accuracy. The method is based on a deterministic Ramanujan graph construction technique using Cayley graphs and Ramanujan coverings. The construction is used for generating sparse and regular bipartite graphs which are stacked to obtain fully connected and convolutional layers for CNN and Res Nets. Experimental results on benchmark data sets demonstrate that even a very sparse Ramanujan network structure can achieve comparable accuracy to that of a fully dense network with similar number of layers and nodes. With the success of sparse deterministic Ramanujan neural networks, our further direction of work is to implement these in the case of transformers and study sparse Ramanujan transformer networks. The proposed deterministic construction technique is expected to significantly reduce the number of parameters and training time while maintaining high accuracy. Simon Alford, Ryan Robinett, Lauren Milechin, and Jeremy Kepner. Training behavior of sparse neural network topologies. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), 2019. doi: 10.1109/HPEC.2019.8916385. Noga Alon. Eigenvalues and expanders. Combinatorica, 6(2):83 96, Jun 1986. ISSN 1439-6912. doi: 10.1007/BF02579166. URL https://doi.org/10.1007/BF02579166. Noga Alon and Vitali D Milman. λ1, isoperimetric inequalities for graphs, and superconcentrators. Journal of Combinatorial Theory, Series B, 38(1):73 88, 1985. Goulnara Arzhantseva and Arindam Biswas. Logarithmic girth expander graphs of SLn(Fp). Journal of Algebraic Combinatorics, 56(3):691 723, Nov 2022. ISSN 1572-9192. doi: 10.1007/s10801-022-01128-z. URL https://doi.org/10.1007/s10801-022-01128-z. Published in Transactions on Machine Learning Research (12/2024) Arindam Biswas. On a Cheeger type inequality in cayley graphs of finite groups. European Journal of Combinatorics, 81:298 308, October 2019. doi: 10.1016/j.ejc.2019.06.009. URL https://doi.org/10. 1016/j.ejc.2019.06.009. Arindam Biswas and Jyoti Prakash Saha. A Cheeger type inequality in finite Cayley sum graphs. Algebraic Combinatorics, 4(3):517 531, 2021. doi: 10.5802/alco.166. URL https://alco.centre-mersenne.org/ articles/10.5802/alco.166/. Arindam Biswas and Jyoti Prakash Saha. A spectral bound for vertex-transitive graphs and their spanning subgraphs. Algebraic Combinatorics, 6(3):689 706, 2023. doi: 10.5802/alco.278. URL https://alco. centre-mersenne.org/articles/10.5802/alco.278/. Emmanuel Breuillard, Ben Green, Robert Guralnick, and Terence Tao. Expansion in finite simple groups of lie type. Journal of the European Mathematical Society, 17(6):1367 1434, 2015. doi: 10.4171/jems/533. URL https://doi.org/10.4171/jems/533. Shantanu Prasad Burnwal, Kaneenika Sinha, and Mathukumalli Vidyasagar. New and explicit constructions of unbalanced Ramanujan bipartite graphs. The Ramanujan Journal, 57(3):1043 1069, April 2021. URL https://doi.org/10.1007/s11139-021-00384-0. Tianlong Chen, Xuxi Chen, Xiaolong Ma, Yanzhi Wang, and Zhangyang Wang. Coarsening the granularity: Towards structurally sparse lottery tickets. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 3025 3039. PMLR, 17 23 Jul 2022. Zhuangzhi Chen, Jingyang Xiang, Yao Lu, Qi Xuan, Zhen Wang, Guanrong Chen, and Xiaoniu Yang. Rgp: Neural network pruning through regular graph with edges swapping. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 13, 2023a. doi: 10.1109/TNNLS.2023.3280899. Zhuangzhi Chen, Jingyang Xiang, Yao Lu, Qi Xuan, Zhen Wang, Guanrong Chen, and Xiaoniu Yang. Rgp: Neural network pruning through regular graph with edges swapping. IEEE Transactions on Neural Networks and Learning Systems, 2023b. doi: 10.1109/TNNLS.2023.3280899. Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations. ar Xiv preprint ar Xiv:2308.06767, 2023. Patrick Chiu. Cubic Ramanujan graphs. Combinatorica, 12:275 285, 1992. Fan Chung. A generalized Alon-Boppana bound and weak Ramanujan graphs. The Electronic Journal of Combinatorics, 23(3), July 2016. doi: 10.37236/5933. URL https://doi.org/10.37236/5933. Michael B Cohen. Ramanujan graphs in polynomial time. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 276 281. IEEE, 2016. Giuliana Davidoff, Peter Sarnak, and Alain Valette. Elementary Number Theory, Group Theory and Ramanujan Graphs. London Mathematical Society Student Texts. Cambridge University Press, 2003. Jozef Dodziuk. Difference equations, isoperimetric inequality and transience of certain random walks. Transactions of the American Mathematical Society, 284(2):787 794, 1984. Kiara Esguerra, Muneeb Nasir, Tong Boon Tang, Afidalina Tumian, and Eric Tatt Wei Ho. Sparsity-aware orthogonal initialization of deep neural networks. IEEE Access, 2023. Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, , and Erich Elsen. Rigging the lottery: Making all tickets winners. In Ar Xi V, 2020. Jonas Fischer and Rebekka Burkholz. Plant n seek: Can you find the winning ticket? In International Conference on Learning Representations, 2022. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR. Open Review.net, 2019. URL https://openreview.net/forum?id=r Jl-b3Rc F7. Published in Transactions on Machine Learning Research (12/2024) Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? In International Conference on Learning Representations, 2020. Aurelien Gribinski and Adam W. Marcus. Existence and polynomial time construction of biregular, bipartite Ramanujan graphs of all degrees, 2021. The GAP Group. GAP Groups, Algorithms, and Programming, Version 4.12.2, 2022. URL https: //www.gap-system.org. Chris Hall, Doron Puder, and William F. Sawin. Ramanujan coverings of graphs. Advances in Mathematics, 323:367 410, 2018. ISSN 0001-8708. doi: https://doi.org/10.1016/j.aim.2017.10.042. URL https://www. sciencedirect.com/science/article/pii/S0001870817303146. Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= v Xj_uc ZQ4h A. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, 2016. Duc Hoang, Shiwei Liu, Radu Marculescu, and Zhangyang Wang. Revisiting pruning at initialization through the lens of Ramanujan graph. In International Conference on Learning Representations, 2023. Shlomo Hoory. A lower bound on the spectral radius of the universal cover of a graph. Journal of Combinatorial Theory, Series B, 93(1):33 43, January 2005. doi: 10.1016/j.jctb.2004.06.001. URL https://doi.org/10.1016/j.jctb.2004.06.001. Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439 561, 2006. J. Kepner and R. Robinett. Radix-net: Structured sparse matrices for deep neural networks. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 268 274. IEEE Computer Society, 2019. Alex Krizhevsky. Learning multiple layers of features from tiny images. In NEURIPS, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NEURIPS, 2012. Steinar Laenen. One-shot neural network pruning via spectral graph sparsification. In TAGML Workshop, ICML, 2023. Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, , and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. In International Conference on Learning Representations, 2022. Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, Ajay Kumar Jaiswal, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! In International Conference on Learning Representations, 2023. Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, 2018. A Lubotzky, R Phillips, and P Sarnak. Ramanujan graphs. Combinatorica, 8:261 277, 1988. Xiaolong Ma, Geng Yuan, Xuan Shen, Tianlong Chen, Xuxi Chen, Xiaohan Chen, Ning Liu, Minghai Qin, Sijia Liu, Zhangyang Wang, and Yanzhi Wang. Sanity checks for lottery tickets: Does your winning ticket really win the jackpot? In Advances in Neural Information Processing Systems, volume 34, pp. 12749 12760, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/ file/6a130f1dc6f0c829f874e92e5458dced-Paper.pdf. Published in Transactions on Machine Learning Research (12/2024) Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. Exploring the regularity of sparse structure in convolutional neural networks, 2017. URL https://arxiv.org/abs/1705. 08922. Adam W. Marcus, Daniel A. Spielman, and Nikhil Srivastava. Interlacing families I: Bipartite Ramanujan graphs of all degrees. Annals. Math., 182:307 325, 2015. Adam W. Marcus, Daniel A. Spielman, and Nikhil Srivastava. Interlacing families IV: Bipartite Ramanujan graphs of all sizes. SIAM Journal on Computing, 47(6):2488 2509, 2018. doi: 10.1137/16M106176X. URL https://doi.org/10.1137/16M106176X. G. A. Margulis. Explicit constructions of graphs without short cycles and low density codes. Combinatorica, 2(1):71 78, Mar 1982. ISSN 1439-6912. doi: 10.1007/BF02579283. URL https://doi.org/10.1007/ BF02579283. Grigorii Aleksandrovich Margulis. Explicit group-theoretical constructions of combinatorial schemes and their application to the design of expanders and concentrators. Problemy peredachi informatsii, 24(1):51 60, 1988. Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, , and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. In Nature communications, 2018. M. Morgenstern. Existence and explicit constructions of q + 1 regular Ramanujan graphs for every prime power q. Journal of Combinatorial Theory, Series B, 62(1):44 62, 1994. ISSN 0095-8956. doi: https://doi.org/10.1006/jctb.1994.1054. URL https://www.sciencedirect.com/science/article/pii/ S0095895684710549. Antoine Musitelli and Pierre de la Harpe. Expanding graphs, Ramanujan graphs, and 1-factor perturbations. Bulletin of the Belgian Mathematical Society-Simon Stevin, 13(4):673 680, 2006. A. Nilli. On the second eigenvalue of a graph. Discrete Mathematics, 91(2):207 210, 1991. ISSN 0012-365X. doi: https://doi.org/10.1016/0012-365X(91)90112-F. URL https://www.sciencedirect.com/science/ article/pii/0012365X9190112F. Laurent Orseau, Marcus Hutter, and Omar Rivasplata. Logarithmic pruning is all you need. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546. Bithika Pal, Arindam Biswas, Sudeshna Kolay, Pabitra Mitra, and Biswajit Basu. A study on the Ramanujan graph property of winning lottery tickets. In International Conference on Machine Learning, pp. 17186 17201, 2022. Hoang Pham, The Anh Ta, Shiwei Liu, Lichuan Xiang, Dung Le, Hongkai Wen, and Long Tran-Thanh. Towards data-agnostic pruning at initialization: What makes a good sparse mask? In Advances in Neural Information Processing Systems, pp. 80044 80065, 2023. Ameya Prabhu, Girish Varma, and Anoop Namboodiri. Deep expander networks: Efficient deep networks from graph theory. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 20 35, 2018. Yehonathan Refael, Iftach Arbel, and Wasim Huleihel. Learning $k$-level structured sparse neural networks using group envelope regularization. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=XPLXYr7Nl R. Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=S1g Sj0NKv B. Published in Transactions on Machine Learning Research (12/2024) Peter Sarnak. Some Applications of Modular Forms. Cambridge Tracts in Mathematics. Cambridge University Press, 1990. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014. Olaf Sporns. Graph Theory Methods for the Analysis of Neural Connectivity Patterns, pp. 171 185. Springer US, 2003. ISBN 978-1-4615-1079-6. doi: 10.1007/978-1-4615-1079-6_12. URL https://doi.org/10. 1007/978-1-4615-1079-6_12. Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, and Dimitris Papailiopoulos. Rare gems: Finding lottery tickets at initialization. In Advances in Neural Information Processing Systems, volume 35, pp. 14529 14540, 2022. James Stewart, Umberto Michieli, and Mete Ozay. Data-free model pruning at initialization via expanders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4518 4523, 2023. Edric Tam and David Dunson. Spectral Gap Regularization of Neural Networks. ar Xiv e-prints, art. ar Xiv:2304.03096, April 2023. doi: 10.48550/ar Xiv.2304.03096. Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. Artem Vysogorets and Julia Kempe. Connectivity matters: Neural network pruning through the lens of effective sparsity. J. Mach. Learn. Res., 24, 2023. Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations,, 2020. Huan Wang, Can Qin, Yue Bai, Yulun Zhang, and Yun Fu. Recent advances on neural network pruning at initialization. ar Xiv preprint ar Xiv:2103.06460, 2021. Jiaxuan You, Jure Leskovec, Kaiming He, and Saining Xie. Graph structure of neural networks. In International Conference on Machine Learning, volume 119, pp. 10881 10891, 2020. A.1 Cayley graph Let G be a group and let S be a subset of G that is closed under inversion i.e., S = S 1. The corresponding Cayley graph C(G, S) is a graph with vertex set the elements of G and edge set {(x, xs) : x G, s S}. As an example of a Cayley graph of a non-abelian group, one can take the group G = D4, the dihedral group of order 8 with elements {e, r, r2, r3, s, sr, sr2, sr3} and generating set S = {r, s, r 1}. Here the r denotes rotation by π 2 and s is reflection. So r4 = 1, s2 = 1 and sr = r 1s. Published in Transactions on Machine Learning Research (12/2024) sr sr2 sr3 s The Cayley Graph C(D4, {s, r, r3}) A.2 Expander graphs and Ramanujan graphs An expander graph is a structurally sparse graph that has strong connectivity properties. The connectivity can be quantified in different ways which give rise to different notions of expanders such as vertex expanders, edge expanders and spectral expanders. These notions are actually interrelated. In the following, a graph Γ = (V, E) is a tuple consisting of a vertex set V and an edge set E which is a subset of V V . A.2.1 Combinatorial expansion Definition A.1 (vertex Cheeger constant). The infimum of the quantity |δ(X)| |X| where δ(X) denotes the outer vertex boundary of X i.e., the set of vertices in Γ which are connected to a vertex in X but do not lie in X as X runs over all non-empty subsets of V satisfying the condition with |X| |V | 2 is known as the vertex Cheeger constant and is denoted by h V (Γ). Definition A.2 (edge-Cheeger constant). The edge boundary of a set S, denoted δS, is δS = the set of edges going out from S to its complement. The edge Cheeger constant of Γ, denoted by h E(Γ), is defined as: h E(Γ) = min |δS| D|S| as S satisfies the following: {S = empty set, |S| n 2 } and D is the maximum degree of the graph Γ The vertex Cheeger constant h V (Γ) and the edge Cheeger constant h E(Γ) are related by the following equivalence h V (Γ) D h E(Γ) h V (Γ), where D denotes the maximum degree of the graph (the degree of each vertex is the number of edges going out from the vertex). This allows one to speak about vertex expansion and edge expansion interchangeably. Having high combinatorial expansion means having high Cheeger constant, a desirable property for our case. A.2.2 Spectral expansion Given a finite undirected graph Γ the eigenvalues λn λ1 of its adjacency matrix are all real and λ1 D with equality iff the graph is D-regular. The spectra, i.e., the distribution of the eigenvalues convey a lot of information about the structure of the graphs. For instance, the quantity λ1 λ2 (also known in the literature as the one sided spectral gap) quantifies the connectivity and the combinatorial expansion of the graph via the discrete Cheeger-Buser inequality, discovered independently by (Dodziuk, 1984) and by (Alon & Milman, 1985). A graph Γ = (V, E) is said to be a spectral expander if the quantities {|λ1| |λ2|, |λ1| |λk|} are both bounded away from zero, where k = n 1 if the graph is bipartite and k = n otherwise. A.2.3 Discrete Cheeger Buser inequality The discrete Cheeger Buser inequality discovered independently by (Dodziuk, 1984) and by (Alon & Milman, 1985) allows one to pass from spectral expansion to combinatorial expansion. The inequality states that 2 α2 2h E(Γ), Published in Transactions on Machine Learning Research (12/2024) where α2 denotes the second smallest eigenvalue of the normalised Laplacian matrix of Γ and is related to the eigenvalues of the adjacency matrix via λi D 1 αi λi d i = 1, 2, . . . , n, where D and d denote the maximal and minimal degrees respectively. See (Chung, 2016) for details. From the above, it is easy to check that a high |λ1| |λ2| ensures a high h E(Γ) and vice-versa. Thus, the two notions of expansion are inter-connected and every spectral expander remains a combinatorial expander. They are actually equivalent for some classes of graphs, for instance bipartite graphs (as the adjacency spectrum is symmetric about the origin), variants of algebraic graphs e.g., see (Breuillard et al., 2015; Biswas, 2019; Biswas & Saha, 2021; 2023) etc. A.2.4 Ramanujan graph bounds, Alon-Bopanna Theorem A d-regular graph is said to be a Ramanujan graph if max{|λ2|, |λk|} 2 d 1. In the case of bipartite graphs, λn = λ1 and λn 1 = λ2, hence the previous expression reduces to |λ2| 2 d 1. For fixed degree, with the sizes of the graphs growing larger and larger, these are the best possible expanders, as given by the Alon-Bopanna bound (Alon, 1986; Nilli, 1991). Theorem A.3 (Alon-Boppana). For every d regular graph on n vertices, The on(1) term is a quantity that tends to zero for every fixed d as n . The bound requires a bit of technical details. However, if one relaxes a bit the right hand side of the above bound then one can easily show the following, d (1 on(1)). The proof goes as follows: Let A be the adjacency matrix of G, then trace(Ak) is the number of all walks of length k in G that start and end in the same vertex. In particular, all the diagonal entries in A2 are d. Thus, trace(A2) nd. On the other hand, trace(A2) = X i λ2 i d2 + (n 1)λ2. Thus, (n 1)λ2 dn d2, which implies that λ2 d n d n 1. See (Hoory et al., 2006) for details. A.2.5 Expanders and Ramanujan graphs from finite simple groups The existence and construction of expanders are a deep question and those of Ramanujan graphs are even deeper. Most of the constructions of expanders are based on Cayley graphs of finite simple groups of Lie type. The first construction is due to Margulis (Margulis, 1982). Later Lubotzky Phillips Sarnak (Lubotzky et al., 1988) constructed Ramanujan graphs from SL2(Fp). Till 2014 these graphs and variants thereof by (Chiu, 1992) and (Morgenstern, 1994) were the only known construction of Ramanujan graphs. Recent works of Markus Spielmann Srivastava (Marcus et al., 2015) have shown the existence of bipartite Ramanujan graphs of all degrees and sizes. Recently, there has also been new research directions on the topic of construction of expanders satisfying other desirable properties such as the diameter-by-girth ratio are bounded, for instance (Arzhantseva & Biswas, 2022). It will be interesting to see if these special expander networks play important roles as architectures for neural networks or not. Question A.4. Why are the LPS graphs (Lubotzky et al., 1988) (sec 4.2) and the graphs presented in sec 4.4 Ramanujan? For the LPS graphs, the proof is quite involved, a nice exposition on the proof can be found in (Davidoff et al., 2003), Chapter 4 and also in (Sarnak, 1990) Chapter 3. For the graphs presented in sec 4.4, a direct computation of the eigenvalues using the propoerties of cyclic shift permutation matrices is enough. See proof of Theorem 3 in (Burnwal et al., 2021). Published in Transactions on Machine Learning Research (12/2024) A.3 Projective Linear Groups Over Finite Fields The projective linear group PGLn(Fq) is a matrix group defined as the group of linear transformations on the projective space over Fq, modulo scalar transformations. Let Fq be a finite field with q elements. Let GLn(Fq) denote the general linear group of n n invertible matrices over Fq. The projective linear group PGLn(Fq) is defined as: PGLn(Fq) = GLn(Fq) Z(GLn(Fq)) where Z(GLn(Fq)) is the center of GLn(Fq), which consists of scalar matrices. Matrix Form of PGL2(Fq) The projective general linear group PGL2(Fq) is the group of projective transformations of the projective line over Fq. It can be expressed in terms of matrices as follows: The general linear group GL2(Fq) consists of all invertible 2 2 matrices over Fq. A matrix A GL2(Fq) is given by: A = a b c d where a, b, c, d Fq and det(A) = ad bc = 0. Center of GL2(Fq) The center of GL2(Fq) consists of scalar matrices of the form: Z(GL2(Fq)) = λI | λ F q where I is the identity matrix: λI = λ 0 0 λ and F q is the multiplicative group of the field Fq. The projective general linear group PGL2(Fq) is defined as the quotient of GL2(Fq) by its center: PGL2(Fq) = GL2(Fq) Z(GL2(Fq)) In matrix terms, an element of PGL2(Fq) can be represented by a matrix A GL2(Fq), where matrices are considered equivalent if they differ by a scalar multiple. Thus, the matrix form of PGL2(Fq) is represented by equivalence classes of matrices of the form: a b c d | a, b, c, d Fq and ad bc = 0 modulo the scalar matrices. The distinct elements of PGL2(F2) are represented by: PGL2(F3) consists of 24 elements. Published in Transactions on Machine Learning Research (12/2024) A.4 Experimental methodology The q and l values used by the Ramanujan Graph construction for the convolutional layers as mentioned in Section 4.5 for various architectures is provided in Table 7. Table 7: Values of q and l to generate Ramanujan graphs for layers of VGG, Alex Net and Res Net VGG13 VGG19 Conv Size q l Conv Size q l [256 256 3 3] 1 13 177 [256 256 3 3] 3 13 177 [512 256 3 3] 1 19 121 [512 256 3 3] 1 19 121 [512 512 3 3] 3 19 242 [512 512 3 3] 7 19 242 Conv to Linear Size q l Conv to Linear Size q l 2448 25088 47 533 2448 25088 47 533 Alex Net Res Net18 Conv Size q l Conv Size q l [384 256 3 3] 1 19 121 [64 64 3 3] 4 7 82 [384 384 3 3] 1 19 181 [128 64 3 3] 1 11 52 [256 384 3 3] 1 13 265 [128 128 3 3] 3 11 104 [256 128 3 3] 1 13 88 [256 256 3 3] 3 13 177 [512 256 3 3] 1 19 121 [512 512 3 3] 3 19 242 Conv to Linear Size q l 2448 25088 47 533