# glad_learning_sparse_graph_recovery__97badfce.pdf Published as a conference paper at ICLR 2020 GLAD: LEARNING SPARSE GRAPH RECOVERY Harsh Shrivastava1 Xinshi Chen2 Binghong Chen1 Guanghui Lan3 Srinivas Aluru1 Han Liu4 Le Song1,5 {1School of Computational Science & Engineering, 2School of Mathematics, 3School of Industrial and Systems Engineering }at Georgia Institute of Technology, 4Computer Science Department at Northwestern University, 5Ant Financial Services Group Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an ℓ1 regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data. 1 INTRODUCTION Recovering sparse conditional independence graphs from data is a fundamental problem in high dimensional statistics and time series analysis, and it has found applications in diverse areas. In computational biology, a sparse graph structure between gene expression data may be used to understand gene regulatory networks; in finance, a sparse graph structure between financial timeseries may be used to understand the relationship between different financial assets. A popular formulation of the problem is an ℓ1 regularization log-determinant estimation of the precision matrix. Based on this convex formulation, many algorithms have been designed to solve this problem efficiently, and one can formally prove that under a list of conditions, the solution of the optimization problem is guaranteed to recover the graph structure with high probability. However, convex optimization based approaches have their own limitations. The hyperparameters, such as the regularization parameters and learning rate, may depend on unknown constants, and need to be tuned carefully to achieve the recovery results. Furthermore, the formulation uses a single regularization parameter for all entries in the precision matrix, which may not be optimal. It is intuitive that one may obtain better recovery results by allowing the regularization parameters to vary across the entries in the precision matrix. However, such flexibility will lead to a quadratic increase in the number of hyperparameters, but it is hard for traditional approaches to search over a large number of hyperparameters. Thus, a new paradigm may be needed for designing more effective sparse recovery algorithms. Recently, there has been a surge of interest in a new paradigm of algorithm design, where algorithms are augmented with learning modules trained directly with data, rather than prescribing every step of the algorithms. This is meaningful because very often a family of optimization problems needs to be solved again and again, similar in structures but different in data. A data-driven algorithm may be able to leverage this distribution of problem instances, and learn an algorithm which performs better than traditional convex formulation. In our case, the sparse graph recovery problem may also need to be solved again and again, where the underlying graphs are different but have similar degree distribution, the magnitude of the precision matrix entries, etc. For instance, gene regulatory networks may be rewiring depending on the time and conditions, and we want to estimate them from gene Published as a conference paper at ICLR 2020 expression data. Company relations may evolve over time, and we want to estimate their graph from stock data. Thus, we will also explore data-driven algorithm design in this paper. Given a task (e.g. an optimization problem), an algorithm will solve it and provide a solution. Thus we can view an algorithm as a function mapping, where the input is the task-specific information (i.e. the sample covariance matrix in our case) and the output is the solution (i.e. the estimated precision matrix in our case). However, it is very challenging to design a data-driven algorithm for precision matrix estimation. First, the input and output of the problem may be large. A neural network parameterization of direct mapping from the input covariance matrix to the output precision matrix may require as many parameters as the square of the number of dimensions. Second, there are many structure constraints in the output. The resulting precision matrix needs to be positive definite and sparse, which is not easy to enforce by a simple deep learning architecture. Third, direct mapping may result in a model with lots of parameters, and hence may require lots of data to learn. Thus a data-driven algorithm needs to be designed carefully to achieve a better bias-variance trade-off and satisfy the output constraints. In this paper, we propose a deep learning model GLAD with following attributes: Uses an unrolled Alternating Minimization (AM) algorithm as an inductive bias. The regularization and the square penalty terms are parameterized as entry-wise functions of intermediate solutions, allowing GLAD to learn to perform entry-wise regularization update. Furthermore, this data-driven algorithm is trained with a collection of problem instances in a supervised fashion, by directly comparing the algorithm outputs to the ground truth graphs. In our experiments, we show that the AM architecture provides very good inductive bias, allowing the model to learn very effective sparse graph recovery algorithm with a small amount of training data. In all cases, the learned algorithm can recover sparse graph structures with much fewer data points from a new problem, and it also works well in recovering gene regulatory networks based on realistic gene expression data generators. Related works. Belilovsky et al. (2017) considers CNN based architecture that directly maps empirical covariance matrices to estimated graph structures. Previous works have parameterized optimization algorithms as recurrent neural networks or policies in reinforcement learning. For instance, Andrychowicz et al. (2016) considered directly parameterizing optimization algorithm as an RNN based framework for learning to learn. Li & Malik (2016) approach the problem of automating algorithm design from reinforcement learning perspective and represent any particular optimization algorithm as a policy. Khalil et al. (2017) learn combinatorial optimzation over graph via deep Q-learning. These works did not consider the structures of our sparse graph recovery problem. Another interesting line of approach is to develop deep neural networks based on unfolding an iterative algorithm Gregor & Le Cun (2010); Chen et al. (2018); Liu et al. (2018). Liu et al. (2018) developed ALISTA which is based on unrolling the Iterative Shrinkage Thresholding Algorithm (ISTA). Sun et al. (2016) developed ADMM-Net , which is also developed for compressive sensing of MRI data. Though these seminal works were primarily developed for compressive sensing applications, they alluded to the general theme of using unrolled algorithms as inductive biases. We thus identify a suitable unrolled algorithm and leverage its inductive bias to solve the sparse graph recovery problem. 2 SPARSE GRAPH RECOVERY PROBLEM AND CONVEX FORMULATION Given m observations of a d-dimensional multivariate Gaussian random variable X = [X1, . . . , Xd] , the sparse graph recovery problem aims to estimate its covariance matrix Σ and precision matrix Θ = (Σ ) 1. The ij-th component of Θ is zero if and only if Xi and Xj are conditionally independent given the other variables {Xk}k =i,j. Therefore, it is popular to impose an ℓ1 regularization for the estimation of Θ to increase its sparsity and lead to easily interpretable models. Following Banerjee et al. (2008), the problem is formulated as the ℓ1-regularized maximum likelihood estimation bΘ = arg minΘ Sd ++ log(det Θ) + tr(bΣΘ) + ρ Θ 1,off, (1) where bΣ is the empirical covariance matrix based on m samples, Sd ++ is the space of d d symmetric positive definite matrices (SPD), and Θ 1,off = P i =j |Θij| is the off-diagonal ℓ1 regularizer with regularization parameter ρ. This estimator is sensible even for non-Gaussian X, since it is minimizing an ℓ1-penalized log-determinant Bregman divergence Ravikumar et al. (2011). The sparse precision matrix estimation problem in Eq. (1) is a convex optimization problem which can be solved by many algorithms. We give a few canonical and advanced examples which are compared in our experiments: Published as a conference paper at ICLR 2020 G-ISTA. G-ISTA is a proximal gradient method, and it updates the precision matrix iteratively Θk+1 ηξkρ(Θk ξk(bΣ Θ 1 k )), where [ηρ(X)]ij := sign(Xij)(|Xij| ρ)+. (2) The step sizes ξk is determined by line search such that Θk+1 is SPD matrix Rolfs et al. (2012). ADMM. Alternating direction methods of multiplier (Boyd et al., 2011) transform the problem into an equivalent constrained form, decouple the log-determinant term and the ℓ1 regularization term, and result in the following augmented Lagrangian form with a penalty parameter λ: log(det Θ) + tr(bΣΘ) + ρ Z 1 + β, Θ Z + 1 2λ Z Θ 2 F . (3) Taking U := β/λ as the scaled dual variable, the update rules for the ADMM algorithm are Y Y + (4/λ)I /2, where Y = bΣ/λ Zk + Uk (4) Zk+1 ηρ/λ(Θk+1 + Uk), Uk+1 Uk + Θk+1 Zk+1 (5) BCD. Block-coordinate decent methods Friedman et al. (2008) updates each column (and the corresponding row) of the precision matrix iteratively by solving a sequence of lasso problems. The algorithm is very efficient for large scale problems involving thousands of variables. Apart from various algorithms, rigorous statistical analysis has also been provided for the optimal solution of the convex formulation in Eq. (1). Ravikumar et al. (2011) established consistency of the estimator bΘ in Eq. (1) in terms of both Frobenius and spectral norms, at rate scaling roughly as bΘ Θ = O ((d + s) log d/m)1/2 with high probability, where s is the number of nonzero entries in Θ . This statistical analysis also reveal certain limitations of the convex formulation: The established consistency is based on a set of carefully chosen conditions, including the lower bound of sample size, the sparsity level of Θ , the degree of the graph, the magnitude of the entries in the covariance matrix, and the strength of interaction between edge and non-edge in the precision matrix (or mutual incoherence on the Hessian Γ := Σ Σ ) . In practice, it may be hard to a problem to satisfy these recovery conditions. Therefore, it seems that there is still room for improving the above convex optimization algorithms for recovering the true graph structure. Prior to the data-driven paradigm for sparse recovery, since the target parameter Θ is unknown, the best precision matrix recovery method is to resort to a surrogate objective function (for instance, equation 1). Optimally tuning the unknown parameter ρ is a very challenging problem in practice. Instead, we can leverage the large amount of simulation or real data and design a learning algorithm that directly optimizes the loss in equation 9. Furthermore, since the log-determinant estimator in Eq. (1) is NOT directly optimizing the recovery objective bΘ Θ 2 F , there is also a mismatch in the optimization objective and the final evaluation objective (refer to the first experiment in section 5.1). This increase the hope one may improve the results by directly optimizing the recovery objective with the algorithms learned from data. 3 LEARNING DATA-DRIVEN ALGORITHM FOR GRAPH RECOVERY In the remainder of the paper, we will present a data-driven method to learn an algorithm for precision matrix estimation, and we call the resulting algorithm GLAD (stands for Graph recovery Learning Algorithm using Data-driven training). We ask the question of Given a family of precision matrices, is it possible to improve recovery results for sparse graphs by learning a data-driven algorithm? More formally, suppose we are given n precision matrices {Θ (i)}n i=1 from a family G of graphs and m samples {x(i,j)}m j=1 associated with each Θ (i). These samples can be used to form n sample covariance matrices {bΣ(i)}n i=1. We are interested in learning an algorithm for precision matrix estimation by solving a supervised learning problem, minf 1 n Pn i=1L(GLADf(bΣ(i)), Θ (i)), where f is a set of parameters in GLAD( ) and the output of GLADf(bΣ(i)) is expected to be a good estimation of Θ (i) in terms of an interested evaluation metric L. The benefit is that it can directly optimize the final evaluation metric which is related to the desired structure or graph properties of a family of problems. However, it is a challenging task to design a good parameterization of GLADf for this graph recovery problem. We will explain the challenges below and then present our solution. Published as a conference paper at ICLR 2020 3.1 CHALLENGES IN DESIGNING LEARNING MODELS Thresholding Graphs Figure 1: A recurrent unit GLADcell. In the literature on learning data-driven algorithms, most models are designed using traditional deep learning architectures, such as fully connected DNN or recurrent neural networks. But, for graph recovery problems, directly using these architectures does not work well due to the following reasons. First, using a fully connected neural network is not practical. Since both the input and the output of graph recovery problems are matrices, the number of parameters scales at least quadratically in d. Such a large number of parameters will need many input-output training pairs to provide a decent estimation. Thus some structures need to be imposed in the network to reduce the size of parameters and sample complexity. Second, structured models such as convolution neural networks (CNNs) have been applied to learn a mapping from bΣ to Θ (Belilovsky et al., 2017). Due to the structure of CNNs, the number of parameters can be much smaller than fully connected networks. However, a recovered graph should be permutation invariant with respect to the matrix rows/columns, and this constraint is very hard to be learned by CNNs, unless there are lots of samples. Also, the structure of CNN is a bias imposed on the model, and there is no guarantee why this structure may work. Third, the intermediate results produced by both fully connected networks and CNNs are not interpretable, making it hard to diagnose the learned procedures and progressively output increasingly improved precision matrix estimators. Fourth, the SPD constraint is hard to impose in traditional deep learning architectures. Although, the above limitations do suggest a list of desiderata when designing learning models: Small model size; Minimalist learning; Interpretable architecture; Progressive improvement; and SPD output. These desiderata will motivate the design of our deep architecture using unrolled algorithms. 3.2 GLAD: DEEP LEARNING MODEL BASED ON UNROLLED ALGORITHM To take into account the above desiderata, we will use an unrolled algorithm as the template for the architecture design of GLAD. The unrolled algorithm already incorporates some problem structures, such as permutation invariance and interpretable intermediate results; but this unrolled algorithm does not traditionally have a learning component, and is typically not directly suitable for gradient-based approaches. We will leverage this inductive bias in our architecture design and augment the unrolled algorithm with suitable and flexible learning components, and then train these embedded models with stochastic gradient descent. GLAD model is based on a reformulation of the original optimization problem in Eq. (1) with a squared penalty term, and an alternating minimization (AM) algorithm for it. More specifically, we consider a modified optimization with a quadratic penalty parameter λ: bΘλ, b Zλ := arg minΘ,Z Sd ++ log(det Θ) + tr(bΣΘ) + ρ Z 1 + 1 2λ Z Θ 2 F (6) and the alternating minimization (AM) method for solving it: λI , where Y = 1 λ bΣ ZAM k ; (7) ZAM k+1 ηρ/λ(ΘAM k+1), (8) where ηρ/λ(θ) := sign(θ) max(|θ| ρ/λ, 0). The derivation of these steps are given in Appendix A. We replace the penalty constants (ρ, λ) by problem dependent neural networks, ρnn and Λnn. These neural networks are minimalist in terms of the number of parameters as the input dimensions are mere {3, 2} for {ρnn, Λnn} and outputs a single value. Algorithm 1 summarizes the update equations for our unrolled AM based model, GLAD. Except for the parameters in ρnn and Λnn, the constant t for initialization is also a learnable scalar parameter. This unrolled algorithm with neural network augmentation can be viewed as a highly structured recurrent architecture as illustrated in Figure 1. There are many traditional algorithms for solving graph recovery problems. We choose AM as our basis because: First, empirically, we tried models built upon other algorithms including G-ISTA, ADMM, etc, but AM-based model gives consistently better performances. Appendix C.10 & C.11 Published as a conference paper at ICLR 2020 discusses different parameterizations tried. Second, and more importantly, the AM-based architecture has a nice property of maintaining Θk+1 as a SPD matrix throughout the iterations as long as λk < . Third, as we prove later in Section 4, the AM algorithm has linear convergence rate, allowing us to use a fixed small number of iterations and still achieve small error margins. 3.3 TRAINING ALGORITHM Algorithm 1: GLAD Function GLADcell(bΣ, Θ, Z, λ): λ Λnn( Z Θ 2 F , λ) Y λ 1bΣ Z λI) For all i, j do ρij = ρnn(Θij, bΣij, Zij) Zij ηρij(Θij) return Θ, Z, λ Function GLAD(bΣ): Θ0 (bΣ + t I) 1, λ0 1 For k = 0 to K 1 do Θk+1, Zk+1, λk+1 GLADcell(bΣ, Θk, Zk, λk) return ΘK, ZK To learn the parameters in GLAD architecture, we will directly optimize the recovery objective function rather than using log-determinant objective. A nice property of our deep learning architecture is that each iteration of our model will output a valid precision matrix estimation. This allows us to add auxiliary losses to regularize the intermediate results of our GLAD architecture, guiding it to learn parameters which can generate a smooth solution trajectory. Specifically, we will use Frobenius norm in our experiments, and design an objective which has some resemblance to the discounted cumulative reward in reinforcement learning: min f lossf := 1 k=1 γK k Θ(i) k Θ 2 where (Θ(i) k , Z(i) k , λ(i) k ) = GLADcellf(bΣ(i), Θ(i) k 1, Z(i) k 1, λ(i) k 1) is the output of the recurrent unit GLADcell at k-th iteration, K is number of unrolled iterations, and γ 1 is a discounting factor. We will use stochastic gradient descent algorithm to train the parameters f in the GLADcell. A key step in the gradient computation is to propagate gradient through the matrix square root in the GLADcell. To do this efficiently, we make use of the property of SPD matrix that X = X1/2X1/2, and the product rule of derivatives to obtain d X = d(X1/2)X1/2 + X1/2d(X1/2). (10) The above equation is a Sylvester s equation for d(X1/2). Since the derivative d X for X is easy to obtain, then the derivative of d(X1/2) can be obtained by solving the Sylvester s equation in (10). The objective function in equation 9 should be understood in a similar way as in Gregor & Le Cun (2010); Belilovsky et al. (2017); Liu et al. (2018) where deep architectures are designed to directly produce the sparse outputs. For GLAD architecture, a collection of input covariance matrix and ground truth sparse precision matrix pairs are available during training, either coming from simulated or real data. Thus the objective function in equation 9 is formed to directly compare the output of GLAD with the ground truth precision matrix. The goal is to train the deep architecture which can perform well for a family/distribution of input covariance matrix and ground truth sparse precision matrix pairs. The average in the objective function is over different input covariance and precision matrix pairs such that the learned architecture is able to perform well over a family of problem instances. Furthermore, each layer of our deep architecture outputs an intermediate prediction of the sparse precision matrix. The objective function takes into account all these intermediate outputs, weights the loss according to the layer of the deep architecture, and tries to progressively bring these intermediate layer outputs closer and closer to the target ground truth. 3.4 A NOTE ON GLAD ARCHITECTURE S EXPRESSIVE ABILITY We note that the designed architecture, is more flexible than just learning the regularization parameters. The component in GLAD architecture corresponding to the regularization parameters are entry-wise and also adaptive to the input covariance matrix and the intermediate outputs. GLAD architecture can adaptively choose a matrix of regularization parameters. This task will be very challenging if the matrix of regularization parameters are tuned manually using cross-validation. A recent theoretical work Sun et al. (2018) also validates the choice of GLAD s design. Published as a conference paper at ICLR 2020 4 THEORETICAL ANALYSIS Since GLAD architecture is obtained by augmenting an unrolled optimization algorithm by learnable components, the question is what kind of guarantees can be provided for such learned algorithm, and whether learning can bring benefits to the recovery of the precision matrix. In this section, we will first analyze the statistical guarantee of running the AM algorithm in Eq. (7) and Eq. (8) for k steps with a fixed quadratic penalty parameter λ, and then interpret its implication for the learned algorithm. First, we need some standard assumptions about the true model from the literature Rothman et al. (2008): Assumption 1. Let the set S = {(i, j) : Θ ij = 0, i = j}. Then card(S) s. Assumption 2. Λmin(Σ ) ϵ1 > 0 (or equivalently Λmax(Θ ) 1/ϵ1), Λmax(Σ ) ϵ2 and an upper bound on bΣ 2 cbΣ. The assumption 2 guarantees that Θ exists. Assumption 1 just upper bounds the sparsity of Θ and does not stipulate anything in particular about s. These assumptions characterize the fundamental limitation of the sparse graph recovery problem, beyond which recovery is not possible. Under these assumptions, we prove the linear convergence of AM algorithm (proof is in Appendix B). Theorem 1. Under the assumptions 1 & 2, if ρ q m , where ρ is the l1 penalty, d is the dimension of problem and m is the number of samples, the Alternate Minimization algorithm has linear convergence rate for optimization objective defined in (6). The kth iteration of the AM algorithm satisfies, ΘAM k Θ F Cλ ΘAM k 1 bΘλ F + OP (log d)/m min( 1 (d+s), λ where 0 < Cλ < 1 is a constant depending on λ. From the theorem, one can see that by optimizing the quadratic penalty parameter λ, one can adjust the Cλ in the bound. We observe that at each stage k, an optimal penalty parameter λk can be chosen depending on the most updated value Cλ. An adaptive sequence of penalty parameters (λ1, . . . , λK) should achieve a better error bound compared to a fixed λ. Since Cλ is a very complicated function of λ, the optimal λk is hard to choose manually. Besides, the linear convergence guarantee in this theorem is based on the sparse regularity parameter m . However, choosing a good ρ value in practice is tedious task as shown in our experiments. In summary, the implications of this theorem are: An adaptive sequence (λ1, . . . , λK) should lead to an algorithm with better convergence than a fixed λ, but the sequence may not be easy to choose manually. Both ρ and the optimal λk depend on the corresponding error ΘAM bΘλ F , which make these parameters hard to prescribe manually. Since, the AM algorithm has a fast linear convergence rate, we can run it for a fixed number of iterations K and still converge with a reasonable error margin. Our learning augmented deep architecture, GLAD, can tune these sequence of λk and ρ parameters jointly using gradient descent. Moreover, we refer to a recent work by Sun et al. (2018) where they considered minimizing the graphical lasso objective with a general nonconvex penalty. They showed that by iteratively solving a sequence of adaptive convex programs one can achieve even better error margins (refer their Algorithm 1 & Theorem 3.5). In every iteration they chose an adaptive regularization matrix based on the most recent solution and the choice of nonconvex penalty. We thus hypothesize that we can further improve our error margin if we make the penalty parameter ρ nonconvex and problem dependent function. We choose ρ as a function depending on the most up-todate solution (Θk, bΣ, Zk), and allow different regularizations for different entries of the precision matrix. Such flexibility potentially improves the ability of GLAD model to recover the sparse graph. Published as a conference paper at ICLR 2020 5 EXPERIMENTS In this section, we report several experiments to compare GLAD with traditional algorithms and other data-driven algorithms. The results validate the list of desiderata mentioned previously. Especially, it shows the potential of pushing the boundary of traditional graph recovery algorithms by utilizing data. Python implementation (tested on P100 GPU) is available1. Exact experimental settings details are covered in Appendix C. Evaluation metric. We use normalized mean square error (NMSE) and probability of success (PS) to evaluate the algorithm performance. NMSE is 10 log10(E Θp Θ 2 F /E Θ 2 F ) and PS is the probability of correct signed edge-set recovery, i.e., P sign(Θp ij) = sign(Θ ij), (i, j) E(Θ ) , where E(Θ ) is the true edge set. Notation. In all reported results, D stands for dimension d of the random variable, M stands for sample size and N stands for the number of graphs (precision matrices) that is used for training. 5.1 BENEFIT OF DATA-DRIVEN GRADIENT-BASED ALGORITHM 0 10 20 30 40 50 Iterations ADMM: D=100, M=100 ρ=0.01 ρ=0.03 ρ=0.07 ρ=0.10 0 10 20 30 40 50 Iterations Objective value ADMM: D=100, M=100 ρ=0.01 ρ=0.03 ρ=0.07 ρ=0.10 Figure 2: Convergence of ADMM in terms of NMSE and optimization objective. (Refer to Appendix C.2). Inconsistent optimization objective. Traditional algorithms are typically designed to optimize the ℓ1-penalized log likelihood. Since it is a convex optimization, convergence to optimal solution is usually guaranteed. However, this optimization objective is different from the true error. Taking ADMM as an example, it is revealed in Figure 2 that, although the optimization objective always converges, errors of recovering true precision matrices measured by NMSE have very different behaviors given different regularity parameter ρ, which indicates the necessity of directly optimizing NMSE and hyperparameter tuning. HHHHH ρ λ 5 1 0.5 0.1 0.01 0.01 -2.51 -2.25 -2.06 -2.06 -2.69 0.03 -5.59 -9.05 9.48 -9.61 -9.41 0.07 -9.53 -7.58 -7.42 -7.38 -7.46 0.1 -9.38 -6.51 -6.43 -6.41 -6.50 0.2 -6.76 -4.68 -4.55 -4.47 -4.80 Table 1: NMSE results for ADMM. Expensive hyperparameter tuning. Although hyperparameters of traditional algorithms can be tuned if the true precision matrices are provided as a validation dataset, we want to emphasize that hyperparamter tuning by grid search is a tedious and hard task. Table 1 shows that the NMSE values are very sensitive to both ρ and the quadratic penalty λ of ADMM method. For instance, the optimal NMSE in this table is 9.61 when λ = 0.1 and ρ = 0.03. However, it will increase by a large amount to 2.06 if ρ is only changed slightly to 0.01. There are many other similar observations in this table, where slight changes in parameters can lead to significant NMSE differences, which in turns makes grid-search very expensive. G-ISTA and BCD follow similar trends. For a fair comparison against GLAD which is data-driven, in all following experiments, all hyperparameters in traditional algorithms are fine-tuned using validation datasets, for which we spent extensive efforts (See more details in Appendix C.3, C.6). In contrast, the gradient-based training of GLAD turns out to be much easier. 5.2 CONVERGENCE Figure 3: Minimalist neural network architectures designed for GLAD experiments in sections(5.2, 5.3, 5.4, 5.5). Refer Appendix C.5 for further details about the architecture parameters. We follow the experimental setting in (Rolfs et al., 2012; Mazumder & Agarwal, 2011; Lu, 2010) to generate data and perform synthetic experiments on multivariate Gaussians. Each offdiagonal entry of the precision matrix is drawn from a uniform distribution, i.e., Θ ij U( 1, 1), and then set to zero with probability p = 1 s, where s means the sparsity level. Finally, an appropriate multiple of the identity matrix was added to the current matrix, so that the resulting matrix had the smallest eigenvalue as 1 (refer to Appendix C.1). We use 30 unrolled steps for GLAD (Figure 3) and compare 1code: https://drive.google.com/open?id=16POE4TMp7UUie Lc Lq Rz STqzk VHm2stl M Published as a conference paper at ICLR 2020 it to G-ISTA, ADMM and BCD. All algorithms are trained/finetuned using 10 randomly generated graphs and tested over 100 graphs. Time/itr D=25 D=100 ADMM 1.45 16.45 G-ISTA 37.51 41.47 GLAD 2.81 20.23 Table 2: ms per iteration. Convergence results and average runtime of different algorithms on Nvidia s P100 GPUs are shown in Figure 4 and Table 2 respectively. GLAD consistently converges faster and gives lower NMSE. Although the fine-tuned G-ISTA also has decent performance, the computation time in each iteration is much longer than GLAD because it requires line search steps. Besides, we could also see a progressive improvement of GLAD across its iterations. 0 20 40 Iterations D = 100; M = 20 BCD ADMM G-ISTA GLAD 0 20 40 Iterations 5D = 100; M = 100 BCD ADMM G-ISTA GLAD 0 20 40 Iterations 5D = 100; M = 500 BCD ADMM G-ISTA GLAD 0 20 40 Iterations 5D = 100; M = 20 BCD ADMM G-ISTA GLAD 0 20 40 Iterations 5D = 100; M = 100 BCD ADMM G-ISTA GLAD 0 20 40 Iterations 5D = 100; M = 500 BCD ADMM G-ISTA GLAD Figure 4: GLAD vs traditional methods. Left 3 plots: Sparsity level is fixed as s = 0.1. Right 3 plots: Sparsity level of each graph is randomly sampled as s U(0.05, 0.15). Results are averaged over 100 test graphs where each graph is estimated 10 times using 10 different sample batches of size M. Standard error is plotted but not visible. Intermediate steps of BCD are not evaluated because we use sklearn package Pedregosa et al. (2011) and can only access the final output. Appendix C.4, C.5 explains the experiment setup and GLAD architecture. 5.3 RECOVERY PROBABILITY Number of samples (log scale) Probability of success D = 36; Graph = grid Number of samples (log scale) Probability of success D = 100; Graph = grid Number of samples (log scale) Probability of success D = 36; Graph = random Number of samples (log scale) Probability of success D = 100; Graph = random Figure 5: Sample complexity for model selection consistency. As analyzed by Ravikumar et al. (2011), the recovery guarantee (such as in terms of Frobenius norm) of the ℓ1 regularized log-determinant optimization significantly depends on the sample size and other conditions. Our GLAD directly optimizes the recovery objective based on data, and it has the potential of pushing the sample complexity limit. We experimented with this and found the results positive. We follow Ravikumar et al. (2011) to conduct experiments on GRID graphs, which satisfy the conditions required in (Ravikumar et al., 2011). Furthermore, we conduct a more challenging task of recovering restricted but randomly constructed graphs (see Appendix C.7 for more details). The probability of success (PS) is non-zero only if the algorithm recovers all the edges with correct signs, plotted in Figure 5. GLAD consistently outperforms traditional methods in terms of sample complexity as it recovers the true edges with considerably fewer number of samples. 5.4 DATA EFFICIENCY Having a good inductive bias makes GLAD s architecture quite data-efficient compared to other deep learning models. For instance, the state-of-the-art Deep Graph by Belilovsky et al. (2017) is based on CNNs. It contains orders of magnitude more parameters than GLAD. Furthermore, it takes roughly 100, 000 samples, and several hours for training their DG-39 model. In contrast, GLAD learns well with less than 25 parameters, within 100 training samples, and notably less training time. Table 3 also shows that GLAD significantly outperforms DG-39 model in terms of AUC (Area under the ROC curve) by just using 100 training graphs, typically the case for real world settings. Fully connected DL models are unable to learn from such small data and hence are skipped in the comparison. Published as a conference paper at ICLR 2020 Table 3: AUC on 100 test graphs for D=39: For experiment settings, refer Table 1 of Belilovsky et al. (2017). Gaussian Random graphs with sparsity p = 0.05 were chosen and edge values sampled from U( 1, 1). (Refer appendix(C.8)) Methods M=15 M=35 M=100 BCD 0.578 0.006 0.639 0.007 0.704 0.006 DG-39 0.664 0.008 0.738 0.006 0.759 0.006 DG-39+P 0.672 0.008 0.740 0.007 0.771 0.006 GLAD 0.788 0.003 0.811 0.003 0.878 0.003 5.5 GENE REGULATION DATA The Syn TRe N (Van den Bulcke et al., 2006) is a synthetic gene expression data generator specifically designed for analyzing the sparse graph recovery algorithms. It models different types of biological interactions and produces biologically plausible synthetic gene expression data. Figure 6 shows that GLAD performs favourably for structure recovery in terms of NMSE on the gene expression data. As the governing equations of the underlying distribution of the Syn TRe N are unknown, these experiments also emphasize the ability of GLAD to handle non-Gaussian data. Figure 7 visualizes the edge-recovery performance of GLAD models trained on a sub-network of true Ecoli bacteria data. We denote, TPR: True Positive Rate, FPR: False Positive Rate, FDR: False Discovery Rate. The number of simulated training/validation graphs were set to 20/20. One batch of M samples were taken per graph (details in Appendix C.9). Although, GLAD was trained on graphs with D = 25, it was able to robustly recover a higher dimensional graph D = 43 structure. Appendix C.12 contains details of the experiments done on real E.Coli data. The GLAD model was trained using the Syn TRe N simulator. Appendix C.13 explains our proposed approach to scale for larger problem sizes. Figure 6: Performance on the Syn TRe N generated gene expression data with graph as Erdos-renyi having sparsity p = 0.05. Refer appendix(C.9) for experiment details. 0 10 20 30 40 50 Iterations D = 25 and M = 10 ADMM G-ISTA GLAD 0 10 20 30 40 50 Iterations D = 25 and M = 25 ADMM G-ISTA GLAD 0 10 20 30 40 50 Iterations D = 25 and M = 100 ADMM G-ISTA GLAD Figure 7: Recovered graph structures for a sub-network of the E. coli consisting of 43 genes and 30 interactions with increasing samples. Increasing the samples reduces the fdr by discovering more true edges. nlp D_rpo S rpi R_als BAC ecf I fnr rob tna LAB fdh F rpo E_rse ABC gln ALG nar GHJI (a) True graph nlp D_rpo S rpi R_als BAC ecf I fnr rob tna LAB fdh F rpo E_rse ABC gln ALG nar GHJI (b) M=10, fdr=0.613, tpr=0.913, fpr=0.114 nlp D_rpo S rpi R_als BAC ecf I fnr rob tna LAB fdh F rpo E_rse ABC gln ALG nar GHJI (c) M=100, fdr=0.236, tpr=0.986, fpr=0.024 6 CONCLUSION & FUTURE WORK We presented a novel neural network, GLAD, for the sparse graph recovery problem based on an unrolled Alternating Minimization algorithm. We theoretically prove the linear convergence of AM algorithm as well as empirically show that learning can further improve the sparse graph recovery. The learned GLAD model is able to push the sample complexity limits thereby highlighting the potential of using algorithms as inductive biases for deep learning architectures. Further development of theory is needed to fully understand and realize the potential of this new direction. ACKNOWLEDGEMENT We thank our colleague Haoran Sun for his helpful comments. This research was supported in part through research cyberinfrastructure resources and services provided by the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology, Atlanta, Georgia, USA (PACE, 2017). This research was also partly supported by XSEDE Campus Champion Grant GEO150002. Published as a conference paper at ICLR 2020 Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981 3989, 2016. Onureena Banerjee, Laurent El Ghaoui, and Alexandre d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine learning research, 9(Mar):485 516, 2008. Eugene Belilovsky, Kyle Kastner, Gaël Varoquaux, and Matthew B Blaschko. Learning to discover sparse graphical models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 440 448. JMLR. org, 2017. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1 122, 2011. Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pp. 9061 9071, 2018. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008. Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399 406. Omnipress, 2010. Ravindran Kannan and Santosh Vempala. Randomized algorithms in numerical linear algebra. Acta Numerica, 26:95 135, 2017. Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348 6358, 2017. Ke Li and Jitendra Malik. Learning to optimize. ar Xiv preprint ar Xiv:1606.01885, 2016. Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista: Analytic weights are as good as learned weights in lista. 2018. Zhaosong Lu. Adaptive first-order methods for general sparse inverse covariance selection. SIAM Journal on Matrix Analysis and Applications, 31(4):2000 2016, 2010. Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Andrej Aderhold, Richard Bonneau, Yukun Chen, et al. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796, 2012. Rahul Mazumder and Deepak K Agarwal. A flexible, scalable and efficient algorithmic framework for primal graphical lasso. ar Xiv preprint ar Xiv:1110.5508, 2011. PACE. Partnership for an Advanced Computing Environment (PACE), 2017. URL http://www. pace.gatech.edu. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011. Published as a conference paper at ICLR 2020 Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, and Arian Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems, pp. 1574 1582, 2012. Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494 515, 2008. Jian Sun, Huibin Li, Zongben Xu, et al. Deep admm-net for compressive sensing mri. In Advances in neural information processing systems, pp. 10 18, 2016. Qiang Sun, Kean Ming Tan, Han Liu, and Tong Zhang. Graphical nonconvex optimization via an adaptive convex relaxation. In International Conference on Machine Learning, pp. 4817 4824, 2018. Tim Van den Bulcke, Koenraad Van Leemput, Bart Naudts, Piet van Remortel, Hongwu Ma, Alain Verschoren, Bart De Moor, and Kathleen Marchal. Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC bioinformatics, 7 (1):43, 2006. Published as a conference paper at ICLR 2020 A DERIVATION OF ALTERNATING MINIMIZATION STEPS Given the optimization problem bΘλ, b Zλ := arg minΘ,Z Sd ++ log(det Θ) + tr(bΣΘ) + ρ Z 1 + 1 2λ Z Θ 2 F , (12) Alternating Minimization is performing ΘAM k+1 arg min Θ Sd ++ log(det Θ) + tr(bΣΘ) + 1 2λ ZAM k Θ 2 F (13) ZAM k+1 arg min Z Sd ++ tr(bΣΘAM k+1) + ρ Z 1 + 1 2λ Z ΘAM k+1 2 F . (14) Taking the gradient of the objective function with respect to Θ to be zero, we have Θ 1 + bΣ + λ(Θ Z) = 0. (15) Taking the gradient of the objective function with respect to Z to be zero, we have ρ ℓ1(Z) + λ(Z Θ) = 0, (16) where 1 Zij > 0, 1 Zij < 0, [ 1, 1] Zij = 0. (17) Solving the above two equations, we obtain: λI = arg min Θ Sd ++ log(det Θ) + tr(bΣΘ) + 1 2λ Z Θ 2 F , (18) where Y = 1 λ bΣ Z, (19) ηρ/λ(Θ) = arg min Z Sd ++ tr(bΣΘ) + ρ Z 1 + 1 2λ Z Θ 2 F . (20) B LINEAR CONVERGENCE RATE ANALYSIS Proof of Theorem Theorem 1. Under the assumptions 1 & 2, if ρ q m , where ρ is the l1 penalty, d is the dimension of problem and m is the number of samples, the Alternate Minimization algorithm has linear convergence rate for optimization objective defined in (6). The kth iteration of the AM algorithm satisfies, ΘAM k Θ F Cλ ΘAM k 1 bΘλ F + OP (log d)/m min( 1 (d+s), λ where 0 < Cλ < 1 is a constant depending on λ. Published as a conference paper at ICLR 2020 We will reuse the following notations in the appendix: bΣm : sample covariance matrix based on m samples, (21) G(Θ; ρ) := log(det Θ) + tr(ˆΣmΘ) + ρ Θ 1,off , (22) bΘG := arg minΘ Sd ++ G(Θ; ρ), (23) f(Θ, Z; ρ, λ) := log(det Θ) + tr(bΣΘ) + ρ Z 1 + 1 2λ Z Θ 2 F , (24) bΘλ, b Zλ := arg min θ,Z Sd ++ f(Θ, Z; ρ, λ), (25) f (ρ, λ) := min θ,Z Sd ++ f(Θ, Z; ρ, λ) = f(bΘλ, b Zλ; ρ, λ), (26) ηρ/λ(θ) :=sign(θ) max(|θ| ρ/λ, 0). (27) The update rules for Alternating Minimization are: λI , where Y = 1 λ bΣ ZAM k ; (28) ZAM k+1 ηρ/λ(ΘAM k+1), (29) Assumptions: With reference to the theory developed in Rothman et al. (2008), we make the following assumptions about the true model. (OP( ) is used to denote bounded in probability.) Assumption 1. Let the set S = {(i, j) : Θ ij = 0, i = j}. Then card(S) s. Assumption 2. Λmin(Σ ) ϵ1 > 0 (or equivalently Λmax(Θ ) 1/ϵ1), Λmax(Σ ) ϵ2 and an upper bound on bΣ 2 cbΣ. We now proceed towards the proof: Lemma 2. For any x, y, k R, k > 0, x = y, (x y)2 1 1 q k + 1) . (30) (x y)2 = (x2 + k) + (y2 + k) 2 y2 + k (x y)2 (31) = (x y)2 2( y2 + k (xy + k)) (x y)2 (32) y2 + k (xy + k))( y2 + k + (xy + k)) y2 + k + (xy + k)) (33) = 1 2 k(x y)2 y2 + k + (xy + k)) (34) y2 + k + (xy + k) (35) k + 1) (36) Lemma 3. For any X, Y Sd, λ > 0, A(Y ) = q λI satisfies the following inequality, A(X) A(Y ) F αλ X Y F , (37) where 0 < αλ = 1 1 4 Λmax(X)2 + 1) 1/2( λ 4 Λmax(Y )2 + 1) 1/2 < 1, Λmax(X) is the largest eigenvalue of X in absolute value. Published as a conference paper at ICLR 2020 Proof. First we factorize X using eigen decomposition, X = Q XDXQX, where QX and DX are orthogonal matrix and diagonal matrix, respectively. Then we have, Q XD2 XQX + 4 Q X(D2 X + 4 λI)QX = Q X Similarly, the above equation holds for Y . Therefore, A(X) A(Y ) F = where we define Q := QY Q X. Similarly, we have, X Y F = Q XDXQX Q Y DY QY F (43) = DX QXQ Y DY QY Q X F (44) = DX Q DY Q F . (45) Then the i-th entry on the diagonal of Q DY Q is Pd j=1 DY jj Q2 ji. Using the fact that DX and DY are diagonal, we have, X Y 2 F = DX Q DY Q 2 j=1 DY jj Q2 ji)2 + j=1 DY jj Q2 ji)2 (47) i=1 DXii(DXii 2 j=1 DY jj Q2 ji) (48) i=1 (D2 Xii + D2 Y ii) 2 j=1 DXii DY jj Q2 ji (49) j=1 Q2 ji(DXii DY jj)2. (50) The last step makes use of Pd i=1 Q2 ji = 1 and Pd j=1 Q2 ji = 1. Similarly, using (42), we have, A(X) A(Y ) 2 F = D2 Y jj + 4 Assuming X Y F > 0 (otherwise (37) trivially holds), using (52) and (50), we have, A(X) A(Y ) 2 F X Y 2 F = Pd i=1 Pd j=1 Q2 ji( q D2 Y jj + 4 Pd i=1 Pd j=1 Q2 ji(DXii DY jj)2 (53) max i,j=1, ,d, DXii =DY jj D2 Y jj + 4 (DXii DY jj)2 (54) Published as a conference paper at ICLR 2020 Using lemma (2), we have, A(X) A(Y ) 2 F X Y 2 F max i,j=1, ,d, DXii =DY jj D2 Y jj + 4 (DXii DY jj)2 (55) max i,j=1, ,d, DXii =DY jj 1 1 r 4 λ + 1)( D2 Y jj 4 λ + 1) (56) 4 maxi D2 Xii + 1)( λ 4 maxj D2 Y jj + 1) (57) 4 Λmax(X)2 + 1) 1/2(λ 4 Λmax(Y )2 + 1) 1/2. (58) A(X) A(Y ) F 4 Λmax(X)2 + 1) 1/2(λ 4 Λmax(Y )2 + 1) 1/2 (59) 4 Λmax(X)2 + 1) 1/2(λ 4 Λmax(Y )2 + 1) 1/2. (60) Lemma 4. Under assumption (2), the output of the k-th and (k + 1)-th AM step ZAM k , ZAM k+1 and ΘAM k+1 satisfy the following inequality, ZAM k+1 b Zλ F ΘAM k+1 bΘλ F Cλ ZAM k b Zλ F , (61) where 0 < Cλ < 1 is a constant depending on λ. Proof. The first part is easy to show, if we observe that in the second update step of AM (8), ηρ/λ is a contraction under metric d(X, Y ) = X Y F . Therefore we have, ZAM k+1 b Zλ F = ηρ/λ(ΘAM k+1) ηρ/λ(bΘλ) F ΘAM k+1 bΘλ F . (62) Next we will prove the second part. To simplify notation, we let A(X) = q λI. Using the first update step of AM (7), we have, ΘAM k+1 bΘλ F = Y k+1Yk+1 + 4 (Yk+1 Yλ) + ( Y k+1Yk+1 + 4 2 (Yk+1 Yλ) + (A(Yk+1) A(Yλ)) F (65) 2 Yk+1 Yλ F + 1 2 A(Yk+1) A(Yλ) F , (66) where Yk+1 = 1 λ bΣ ZAM k and Yλ = 1 λ bΣ b Zλ. The last derivation step makes use of the triangle inequality. Using lemma (3), we have, ΘAM k+1 bΘλ F 1 2 Yk+1 Yλ F + 1 2αλ Yk+1 Yλ F . (67) Therefore ΘAM k+1 bΘλ F Cλ Yk+1 Yλ F = Cλ ZAM k b Zλ F , (68) 4 Λmax(Yk+1)2 + 1) 1/2(λ 4 Λmax(Yλ)2 + 1) 1/2 (69) = 1 (λΛmax(Yk+1)2 + 4) 1/2(λΛmax(Yλ)2 + 4) 1/2 1, (70) Published as a conference paper at ICLR 2020 Λmax(X) is the largest eigenvalue of X in absolute value. The rest is to show that both Λmax(Yλ) and Λmax(Yk+1) are bounded using assumption (2). For Λmax(Yk+1), we have, Λmax(Yk+1) = Yk+1 2 = 1 λ bΣ ZAM k 1 λ bΣ 2 + ZAM k 2 (72) λcbΣ + ZAM k b Zλ F + b Zλ F . (73) Combining (62) and (68), we have, ZAM k+1 b Zλ F ΘAM k+1 bΘλ F Cλ ZAM k b Zλ F . (74) Therefore, ZAM k b Zλ F ZAM k 1 b Zλ F ZAM 0 b Zλ F . (75) Continuing with (73), we have, Λmax(Yk+1) 1 λcbΣ + ZAM k b Zλ F + b Zλ F (76) λcbΣ + ZAM 0 b Zλ F + b Zλ F (77) λcbΣ + ZAM 0 F + 2 b Zλ F . (78) Since b Zλ is the minimizer of a strongly convex function, its norm is bounded. And we also have ZAM 0 F bounded in Algorithm (1), so Λmax(Yk+1) is bounded above whenever λ < . For Λmax(Yλ), we have, Λmax(Yλ) = 1 λ bΣ b Zλ 1 λ bΣ 2 + b Zλ 2 (80) λcbΣ + b Zλ 2 . (81) Therefore both Λmax(Yλ) and Λmax(Yk+1) are bounded in (70), i.e. 0 < Cλ < 1 is a constant only depending on λ. Theorem 1. Under the assumptions 1 & 2, if ρ q m , where ρ is the l1 penalty, d is the dimension of problem and m is the number of samples, the Alternate Minimization algorithm has linear convergence rate for optimization objective defined in (6). The kth iteration of the AM algorithm satisfies, ΘAM k Θ F Cλ ΘAM k 1 bΘλ F + OP (log d)/m min( 1 (d+s), λ where 0 < Cλ < 1 is a constant depending on λ. Proof. (1) Error between bΘλ and bΘG Combining the following two equations: f(bΘλ, b Zλ; ρ, λ) = min Θ,Z f(Θ, Z; ρ, λ) min Θ f(Θ, Z = Θ; ρ, λ) = min Θ G(Θ; ρ) = G(bΘG; ρ), f(Θ, Z; ρ, λ) = G(Θ; ρ) + ρ( Z 1 Θ 1) + 1 2λ Z Θ 2 F , 0 G(bΘλ; ρ) G(bΘG; ρ) ρ( bΘλ 1 b Zλ 1). Published as a conference paper at ICLR 2020 Note that by the optimality condition, zf(bΘλ, b Zλ, ρ, λ) = 0, we have the fixed point equation b Zλ = ηρ/λ(bΘλ). Therefore, bΘλ 1 b Zλ 1 ρd2 λ and we have: 0 G(bΘλ; ρ) G(bΘG; ρ) ρ2d2 Since G is σG-strongly convex, where σG is independent of the sample covariance matrix c Σ as the hessian of G is independent of bΣ . σG F + G(bΘG; ρ), bΘλ bΘG G(bΘλ; ρ) G(bΘG; ρ). (83) Therefore, bΘλ bΘG F Proof. (2) Error between bΘG and Θ Corollary 5 (Theorem 1. of Rothman et al. (2008)). Let bΘG be the minimizer for the optimization objective G(Θ; ρ). Under Assumptions 1 & 2, if ρ q bΘG Θ F = OP (d + s) log d (3) Error between ΘAM k and Θ Under the conditions in Corollary 5, we use triangle inequality to combine the above results with Corollary 5 and Lemma 4. ΘAM k Θ F ΘAM k bΘλ F + bΘλ bΘG F + bΘG Θ F (86) Cλ ΘAM k 1 bΘλ F + O (d + s) log d Cλ ΘAM k 1 bΘλ F + OP (d + s) log d m. min(1, (d+s)λ Cλ ΘAM k 1 bΘλ F + OP (log d)/m min( 1 (d+s), λ C EXPERIMENTAL DETAILS This section contains the detailed settings used in the experimental evaluation section. C.1 SYNTHETIC DATASET GENERATION For sections 5.1 and 5.2, the synthetic data was generated based on the procedure described in Rolfs et al. (2012). A d dimensional precision matrix Θ was generated by initializing a d d matrix with its off-diagonal entries sampled i.i.d. from a uniform distribution Θij U( 1, 1). These entries were then set to zero based on the sparsity pattern of the corresponding Erdos-Renyi random graph with a certain probability p. Finally, an appropriate multiple of the identity matrix was added to the current matrix, so that the resulting matrix had the smallest eigenvalue as 1. In this way, Θ was ensured to be Published as a conference paper at ICLR 2020 0 20 40 Iterations D = 100 and M = 20 conv_loss_L10 conv_loss_L20 conv_loss_L30 conv_loss_L50 0 20 40 Iterations D = 100 and M = 100 conv_loss_L10 conv_loss_L20 conv_loss_L30 conv_loss_L50 0 20 40 Iterations D = 100 and M = 500 conv_loss_L10 conv_loss_L20 conv_loss_L30 conv_loss_L50 Figure 8: Varying the number of unrolled iterations. The results are averaged over 1000 test graphs. The L variable is the number of unrolled iterations. We observe that the higher number of unrolled iterations better is the performance. a well-conditioned, sparse and positive definite matrix. This matrix was then used in the multivariate Gaussian distribution N(0, Θ 1), to obtain M i.i.d samples. C.2 EXPERIMENT DETAILS: BENEFIT OF DATA-DRIVEN GRADIENT-BASED ALGORITHM Figure(2): The plots are for the ADMM method on the Erdos-Renyi graphs (fixed sparsity p = 0.1) with dimension D = 100 and number of samples M = 100. The results are averaged over 100 test graphs with 10 sample batches per graph. The std-err = σ/ 1000 is shown. Refer appendix(C.1) for more details on data generation process. C.3 EXPERIMENT DETAILS: EXPENSIVE HYPERPARAMETERS TUNING Table(1) shows the final NMSE values for the ADMM method on the random graph (fixed sparsity p = 0.1) with dimension D = 100 and number of samples M = 100. We fixed the initialization parameter of Θ0 as t = 0.1 and chose appropriate update rate α for λ . It is important to note that the NMSE values are very sensitive to the choice of t as well. These parameter values changed substantially for a new problem setting. Refer appendix(C.1) for more details on data generation process. C.4 EXPERIMENT DETAILS: CONVERGENCE ON SYNTHETIC DATASETS Figure(4) experiment details: Figure(4) shows the NMSE comparison plots for fixed sparsity and mixed sparsity synthetic Erdos-renyi graphs. The dimension was fixed to D = 100 and the number of samples vary as M = [20, 100, 500]. The top row has the sparsity probability p = 0.5 for the Erdos-Renyi random graph, whereas for the bottom row plots, the sparsity probabilities are uniformly sampled from U(0.05, 0.15). For finetuning the traditional algorithms, a validation dataset of 10 graphs was used. For the GLAD algorithm, 10 training graphs were randomly chosen and the same validation set was used. C.5 GLAD: ARCHITECTURE DETAILS FOR SECTION(5.2) GLAD parameter settings: ρnn was a 4 layer neural network and Λnn was a 2 layer neural network. Both used 3 hidden units in each layer. The non-linearity used for hidden layers was tanh, while the final layer had sigmoid (σ) as the non-linearity for both, ρnn and Λnn (refer Figure 3). The learnable offset parameter of initial Θ0 was set to t = 1. It was unrolled for L = 30 iterations. The learning rates were chosen to be around [0.01, 0.1] and multi-step LR scheduler was used. The optimizer used was adam . The best nmse model was selected based on the validation data performance. Figure(8) explores the performance of GLAD on using varying number of unrolled iterations L. C.6 ADDITIONAL NOTE OF HYPER-PARAMETER FINETUNING FOR TRADITIONAL METHODS Figure(1) shows the average NMSE values over 100 test graphs obtained by the ADMM algorithm on the synthetic data for dimension D = 100 and M = 100 samples as we vary the values of penalty parameter ρ and lagrangian parameter λ. The offset parameter for Θ0 was set to t = 0.1. The NMSE Published as a conference paper at ICLR 2020 Number of samples (log scale) Probability of success Vary rho; Graph = random gista_rho=0.02 gista_rho=0.03 gista_rho=0.04 Number of samples (log scale) Probability of success Vary rho; Graph = random block_rho=0.015 block_rho=0.018 block_rho=0.02 block_rho=0.03 block_rho=0.04 block_rho=0.05 Number of samples (log scale) Probability of success Vary rho; Graph = random admm_rho=0.01 admm_rho=0.02 admm_rho=0.025 admm_rho=0.03 admm_rho=0.04 Figure 9: We attempt to illustrate how the traditional methods are very sensitive to the hyperparameters and it is a tedious exercise to finetune them. The problem setting is same as described in section(5.3). For all the 3 methods shown above, we have already tuned the algorithm specific parameters to a reasonable setting. Now, we vary the L1 penalty term ρ and can observe that how sensitive the probability of success is with even slight change of ρ values. values are very sensitive to the choice of t as well. These parameter values changes substantially for a new problem setting. G-ISTA and BCD follow similar trends. Additional plots highlighting the hyperparameter sensitivity of the traditional methods for model selection consistency experiments. Refer figure(9). C.7 TOLERANCE OF NOISE: EXPERIMENT DETAILS Details for experiments in figure(5). Two different graph types were chosen for this experiment which were inspired from Ravikumar et al. (2011). In the grid graph setting, the edge weight for different precision matrices were uniformly sampled from w U(0.12, 0.25). The edges within a graph carried equal weights. The other setting was more general, where the graph was a random Erdos-Renyi graph with probability of an edge was p = 0.05. The off-diagonal entries of the precision matrix were sampled uniformly from U[0.1, 0.4]. The parameter settings for GLAD were the same as described in Appendix C.5. The model with the best PS performance on the validation dataset was selected. train/valid/test=10/10/100 graphs were used with 10 sample batches per graph. C.8 GLAD: COMPARISON WITH OTHER DEEP LEARNING BASED METHODS Table(3) shows AUC (with std-err) comparisons with the Deep Graph model. For experiment settings, refer Table 1 of Belilovsky et al. (2017). Gaussian Random graphs with sparsity p = 0.05 were chosen and edge values sampled from U( 1, 1). GLAD was trained on only 10 graphs with 5 sample batches per graph. The dimension of the problem is D = 39. The architecture parameter choices of GLAD were the same as described in Appendix C.5 and it performs consistently better along all the settings by a significant AUC margin. C.9 SYNTREN GENE EXPRESSION SIMULATOR DETAILS The Syn TRe N Van den Bulcke et al. (2006) is a synthetic gene expression data generator specifically designed for analyzing the structure learning algorithms. The topological characteristics of the synthetically generated networks closely resemble the characteristics of real transcriptional networks. The generator models different types of biological interactions and produces biologically plausible synthetic gene expression data enabling the development of data-driven approaches to recover the underlying network. The Syn TRe N simulator details for section(5.5). For performance evaluation, a connected Erdos Renyi graph was generated with probability as p = 0.05. The precision matrix entries were sampled from Θij U(0.1, 0.2) and the minimum eigenvalue was adjusted to 1 by adding an appropriate multiple of identity matrix. The Syn TRe N simulator then generated samples from these graphs by incorporating biological noises, correlation noises and other input noises. All these noise levels were sampled uniformly from U(0.01, 0.1). The figure(6) shows the NMSE comparisons for a fixed dimension D = 25 and varying number of samples M = [10, 25, 100]. The number of training/validation graphs were set to 20/20 and the results are reported on 100 test graphs. In these experiments, only 1 batch of M samples were taken per graph to better mimic the real world setting. Published as a conference paper at ICLR 2020 Figure(7) visualizes the edge-recovery performance of the above trained GLAD models on a subnetwork of true Ecoli bacteria data.which contains 30 edges and D = 43 nodes. The Ecoli subnetwork graph was fed to the Syn TRe N simulator and M samples were obtained. Syn TRe N s noise levels were set to 0.05 and the precision matrix edge values were set to w = 0.15. For the GLAD models, the training was done on the same settings as the gene-data NMSE plots with D = 25 and on corresponding number of samples M. C.10 COMPARISON WITH ADMM OPTIMIZATION BASED UNROLLED ALGORITHM In order to find the best unrolled architecture for sparse graph recovery, we considered many different optimization techniques and came up with their equivalent unrolled neural network based deep model. In this section, we compare with the closest unrolled deep model based on ADMM optimization, (ADMMu), and analyze how it compares to GLAD. Appendix C.11 lists down further such techniques for future exploration. Unrolled model for ADMM: Algorithm 2 describes the unrolled model ADMMu updates. ρnn was a 4 layer neural network and Λnn was a 2 layer neural network. Both used 3 hidden units in each layer. The non-linearity used for hidden layers was tanh, while the final layer had sigmoid (σ) as the non-linearity for both ,ρnn and Λnn. The learnable offset parameter of initial Θ0 was set to t = 1. It was unrolled for L = 30 iterations. The learning rates were chosen to be around [0.01, 0.1] and multi-step LR scheduler was used. The optimizer used was adam . Algorithm 2: ADMMu Function ADMMu-cell(bΣ, Θ, Z, U, λ): λ Λnn( Z Θ 2 F , λ) Y λ 1bΣ Z + U λI) For all i, j do ρij = ρnn(Θij, bΣij, Zij, λ) Zij ηρij(Θij + Uij) U U + Θ Z return Θ, Z, U, λ Function ADMMu(bΣ): Θ0 (bΣ + t I) 1, λ0 1 For k = 0 to K 1 do Θk+1, Zk+1, Uk+1, λk+1 ADMMu-cell(bΣ, Θk , Zk, Uk, λk) return ΘK, ZK Figure 10 compares GLAD with ADMMu on the convergence performance with respect to synthetically generated data. The settings were kept same as described in Figure 4. As evident from the plots, we see that GLAD consistently performs better than ADMMu. We had similar observations for other set of experiments as well. Hence, we chose AM based unrolled algorithm over ADMM s as it works better empirically and has less parameters. Although, we are not entirely confident but we hypothesize the reason for above observations as follows. In the ADMM update equations (4 & 5), both the Lagrangian term and the penalty term are intuitively working together as a function to update the entries Θij, Zij. Observe that Uk can be absorbed into Zk and/or Θk and we expect our neural networks to capture this relation. We thus expect GLAD to work at least as good as ADMMu. In our formulation of unrolled ADMMu (Algorithm 2) the update step of U is not controlled by neural networks (as the number of parameters needed will be substantially larger) which might be the reason of it not performing as well as GLAD. Our empirical evaluations corroborate this logic that just by using the penalty term we can maintain all the desired properties and learn the problem dependent functions with a small neural network. C.11 DIFFERENT DESIGNS TRIED FOR DATA-DRIVEN ALGORITHM We tried multiple unrolled parameterizations of the optimization techniques used for solving the graphical lasso problem which worked to varying levels of success. We list here a few, in interest for helping researchers to further pursue this recent and novel approach of data-driven algorithm designing. 1. ADMM + ALISTA parameterization: The threshold update for ZAM k+1 can be replaced by ALISTA network Liu et al. (2018). The stage I of ALISTA is determining W, which is trivial in our case as D = I. So, we get W = I. Thus, combining ALISTA updates along with AM s we get an interesting unrolled algorithm for our optimization problem. Published as a conference paper at ICLR 2020 0 10 20 30 Iterations D = 100; M = 20 0 10 20 30 Iterations D = 100; M = 100 0 10 20 30 Iterations D = 100; M = 500 Figure 10: Convergence on Erdos-random graphs with fixed sparsity (p = 0.1). All the settings are same as the fixed sparsity case described in Figure 4. We see that the AM based parameterization GLAD consistently performs better than the ADMM based unrolled architecture ADMMu . 2. G-ISTA parameterization: We parameterized the line search hyperparameter c as well as replaced the next step size determination step by a problem dependent neural network of Algorithm(1) in Rolfs et al. (2012). The main challenge with this parameterization is to main the PSD property of the intermediate matrices obtained. Learning appropriate parameterization of line search hyperparameter such that PSD condition is maintained remains an interesting aspect to investigate. 3. Mirror Descent Net: We get a similar set of update equations for the graphical lasso optimization. We identify some learnable parameters, use neural networks to make them problem dependent and train them end-to-end. 4. For all these methods we also tried unrolling the neural network as well. In our experience we found that the performance does not improve much but the convergence becomes unstable. C.12 RESULTS ON REAL DATA We use the real data from the DREAM 5 Network Inference challenge (Marbach et al., 2012). This dataset contains 3 compendia that were obtained from microorganisms, some of which are pathogens of clinical relevance. Each compendium consists of hundreds of microarray experiments, which include a wide range of genetic, drug, and environmental perturbations. We test our method for recovering the true E.coli network from the gene expression values recorded by doing actual microarray experiments. The E.coli dataset contains 4511 genes and 805 associated microarray experiments. The true underlying network has 2066 discovered edges and 150214 pairs of nodes do not have an edge between them. There is no data about the remaining edges. For our experiments, we only consider the discovered edges as the ground truth, following the challenge data settings. We remove the genes that have zero degree and then we get a subset of 1081 genes. For our predictions, we ignore the direction of the edges and only consider retrieving the connections between genes. We train the GLAD model using the Syn TRe N simulator on the similar settings as described in Appendix C.9. Briefly, GLAD model was trained on D=50 node graphs sampled from Erdos-Renyi graph with sparsity probability U(0.01, 0.1), noise levels of Syn TRe N simulator sampled from U(0.01, 0.1) and Θij U(0.1, 0.2)). The model was unrolled for 15 iterations. This experiment also evaluates GLAD s ability to generalize to different distribution from training as well as scaling ability to more number of nodes. We report the AUC scores for E.coli network in Table 4 . We can see that GLAD improves over the other competing methods in terms of Area Under the ROC curve (AUC). We understand that it is challenging to model real datasets due to the presence of many unknown latent extrinsic factors, but we do observe an advantage of using data-driven parameterized algorithm approaches. C.13 SCALING FOR LARGE MATRICES Methods BCD GISTA GLAD AUC 0.548 0.541 0.572 Table 4: GLAD vs other methods for the DREAM network inference challenge real E.Coli data. We have shown in our experiments that we can train GLAD on smaller number of nodes and get reasonable results for recovering graph structure with considerably larger nodes (Appendix C.12). Thus, in this section, we focus on scaling up on the inference/test part. Published as a conference paper at ICLR 2020 With the current GPU implementation, we can can handle around 10,000 nodes for inference. For problem sizes with more than 100,000 nodes, we propose to use the randomized algorithm techniques given in Kannan & Vempala (2017). Kindly note that scaling up GLAD is our ongoing work and we just present here one of the directions that we are exploring. The approach presented below is to give some rough idea and may contain loose ends. Randomized algorithms techniques are explained elaborately in Kannan & Vempala (2017). Specifically, we will use some of their key results P1. (Theorem 2.1) We will use the length-squared sampling technique to come up with low-rank approximations P2. (Theorem 2.5) For any large matrix A Rm n, we can use approximate it as A CUR , where C Rm r, U Rs r, R Rr m. P3. (Section 2.3) For any large matrix A Rm n, we can get its approximate SVD by using the property E(RT R) = AT A where R is a matrix obtained by length-squared sampling of the rows of matrix A. The steps for doing approximate AM updates, i.e. of equations(7, 8). Using property P3, we can approximate Y T Y RT R. q λI V q Σ2 + 4 where V is the right singular vectors of R. Thus, we can combine this approximation with the sketch matrix approximation of Y CUR to calculate the update in equation(7). Equation(8) is just a thresholding operation and can be done efficiently with careful implementation. We are looking in to the experimental as well as theoretical aspects of this approach. We are also exploring an efficient distributed algorithm for GLAD. We are investigating into parallel MPI based algorithms for this task (https://stanford.edu/~boyd/admm.html is a good reference point). We leverage the fact that the size of learned neural networks are very small, so that we can duplicate them over all the processors. This is also an interesting future research direction.