# learning_latent_graph_structures_and_their_uncertainty__c682e5fa.pdf Learning Latent Graph Structures and their Uncertainty Alessandro Manenti 1 Daniele Zambon 1 Cesare Alippi 1 2 Graph neural networks use relational information as an inductive bias to enhance prediction performance. Not rarely, task-relevant relations are unknown and graph structure learning approaches have been proposed to learn them from data. Given their latent nature, no graph observations are available to provide a direct training signal to the learnable relations. Therefore, graph topologies are typically learned on the prediction task alongside the other graph neural network parameters. In this paper, we demonstrate that minimizing point-prediction losses does not guarantee proper learning of the latent relational information and its associated uncertainty. Conversely, we prove that suitable loss functions on the stochastic model outputs simultaneously grant solving two tasks: (i) learning the unknown distribution of the latent graph and (ii) achieving optimal predictions of the target variable. Finally, we propose a sampling-based method that solves this joint learning task 1. Empirical results validate our theoretical claims and demonstrate the effectiveness of the proposed approach. 1. Introduction Relational information processing has provided breakthroughs in the analysis of rich and complex data coming from, e.g., social networks, natural language, and biology. This side information takes various forms, from structuring the data into clusters to defining causal relations and hierarchies, and enables machine learning models to condition their predictions on dependency-related observations. In this context, predictive models take the form y = fψ(x, A), where the input-output relation x 7 y modeled by fψ 1Universit a della Svizzera italiana, IDSIA, Lugano, Switzerland 2Politecnico di Milano, Milan, Italy. Correspondence to: Alessandro Manenti . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1Code available at https://github.com/allemanenti/Learning Calibrated-Structures and its parameters in ψ is conditioned on the relational information encoded in variable A. Graph Neural Networks (GNNs) (Scarselli et al., 2008) are one example of models of this kind that rely on a graph structure represented as an adjacency matrix A and have been demonstrated successful in a plethora of applications (Fout et al., 2017; Shlomi et al., 2020). Indeed, relational information is needed to implement such a relational inductive bias and, in some cases, it is provided at the application design phase. However, more frequently, such topological information is not rich enough to address the problem at hand, and not seldom it is completely unavailable. Therefore, Graph Structure Learning (GSL) emerges as an approach to learn the graph topology (Kipf et al., 2018; Franceschi et al., 2019; Yu et al., 2021; Fatemi et al., 2021; Zhu et al., 2021; Cini et al., 2023) alongside the predictive model fψ. This entails formulating a joint learning process that learns the adjacency matrix A or a parameterization of it along with the predictor s parameters ψ. This is usually achieved by optimizing a loss function on the model output y, e.g., a point prediction measure based on the square or the absolute prediction error. Different sources of uncertainty affect the graph structure learning process, including epistemic uncertainty in the data and variability inherent in the data-generating process. Learning appropriate models of the data-generating process can provide valuable insights into the modeled environment with uncertainty quantification enhancing explainability and interpretability, ultimately enabling more informed decisionmaking. Examples of applications are found in the study of infection and information spreading, as well as biological systems (Gomez Rodriguez et al., 2013; Lokhov, 2016; Deleu et al., 2022). It follows that a probabilistic framework is appropriate to accurately capture the uncertainty in the learned relations whenever randomness affects the graph topology. Probabilistic approaches have been devised in recent years. For instance, research carried out by Franceschi et al. (2019); Zhang et al. (2019); Elinas et al. (2020); Cini et al. (2023) propose methods that learn a parametric distribution P θ A over the latent graph structure A. However, none of them have studied whether these approaches were able to learn a calibrated latent distribution P θ A for A, properly reflecting the uncertainty associated with the learned topology. Learning Latent Graph Structures and their Uncertainty In this paper, we fill this gap by addressing the joint problem of learning a predictive model yielding optimal pointprediction performance of the output y and, contextually, a calibrated distribution for the latent adjacency matrix A. In particular, the novel contributions can be summarized as: 1. We demonstrate that models trained to achieve optimal point predictions do not guarantee calibration of the adjacency matrix distribution [Section 4]. 2. We provide theoretical conditions on the predictive model and loss function that guarantee simultaneous calibration of the latent variable and optimal point predictions [Section 5]. 3. We propose a theoretically grounded sampling-based learning method to address the joint learning problem [Section 5]. 4. We empirically validate the theoretical developments and claims presented in this paper and show that the proposed approach outperforms existing methods in solving the joint learning task [Section 6]. Finally, we emphasize the significance of our contribution. The inherent latent nature of A presents substantial learning challenges. Real-world applications rarely provide direct observations of the (latent) graph structure, making it impossible to use such data as learning signals for training the graph distribution P θ A. This lack of real-world observations not only hampers model training but also complicates empirical evaluation of the learned latent distribution. Consequently, flawed decisions may be derived from learned models. This work addresses these limitations by (i) establishing theoretical guarantees for more robust learning of the latent variable to mitigate the need for evaluation on real data, and (ii) conducting a rigorous empirical analysis on synthetic datasets that provide the otherwise missing ground-truth knowledge required for an accurate validation of our claims. 2. Related Work Graph Structure Learning GSL is often employed endto-end with a predictive model to better solve a downstream task. Examples include applications within graph deep learning methods for static (Jiang et al., 2019; Yu et al., 2021; Kazi et al., 2022) and temporal data (Wu et al., 2019; 2020; Cini et al., 2023; De Felice et al., 2024); a recent review is provided by Zhu et al. (2021). Some approaches from the literature model the latent graph structure as stochastic (Kipf et al., 2018; Franceschi et al., 2019; Elinas et al., 2020; Shang et al., 2021; Cini et al., 2023), mainly as a way to enforce sparsity of the adjacency matrix. To operate on discrete latent random variables, Franceschi et al. (2019) utilize straight-through gradient estimations, Cini et al. (2023) rely on score-based gradient estimators, while Niepert et al. (2021) design an implicit maximum likelihood estimation strategy. A different line of research is rooted in graph signal processing, where the graph is estimated from a constrained optimization problem and the smoothness assumption of the signals (Kalofolias, 2016; Dong et al., 2016; Mateos et al., 2019; Coutino et al., 2020; Pu et al., 2021). A few works from the Bayesian literature have tackled the task of estimating uncertainties associated with graph edges. The model-based approaches by Lokhov (2016) and Gray et al. (2020) are two examples tackling relevant applications benefiting from uncertainty quantification. Within the deep learning literature, Zhang et al. (2019) propose a Bayesian Neural Network (BNN) modeling the random graph realizations. Differently, Wasserman & Mateos (2024) develop an interpretable BNN designed over graph signal processing principles using unrolled dual proximal gradient iterations. While some results on the output calibration are exhibited, to the best of our knowledge, no guarantee or evidence of calibration of the latent variable is provided, which we study in this paper instead. Calibration of the model s output Research on model calibration has primarily focused on obtaining accurate and consistent predictions of the statistical properties of the target (random) variables y, from which uncertainty estimates on the model s predictions are derived. For discrete outputs, such as in classification tasks, Guo et al. (2017) investigated the calibration of modern deep learning models and proposed temperature scaling as a solution. Other techniques in the same context include Histogram Binning (Zadrozny & Elkan, 2001), Cross Entropy loss with label smoothing (M uller et al., 2019), and Focal Loss (Mukhoti et al., 2020). For continuous output distributions, Laves et al. (2020) proposed σ scaling, while Kuleshov et al. (2018) developed a technique inspired by Platt scaling. More recently, conformal prediction techniques (Shafer & Vovk, 2008) have gained popularity for providing confidence intervals in predictions. We stress that within this paper, we are mainly concerned with latent variable calibration, rather than output calibration, although the two are related to each other. Deep latent variable models Latent variables are extensively used in deep generative modeling (Kingma & Welling, 2013; Rezende et al., 2014), both with continuous and discrete latent variables (Van Den Oord et al., 2017; Bartler et al., 2019). In deep models, latent random variables often lack direct physical meaning, with only the outputs being collected for training. Therefore, studies mainly focused on maximizing the likelihood of the observed outputs in the training set, rather than calibrating the latent distribution. A few works proposed regularization of the latent space to improve stability and accuracy (Xu & Durrett, 2018; Joo et al., Learning Latent Graph Structures and their Uncertainty 2020), facilitate smoother transitions in the output when the latent variable is slightly modified (Hadjeres et al., 2017), and apply other techniques aimed at enhancing data generation or improving model performance in general (Connor et al., 2021). To the best of our knowledge, no prior work has studied the joint learning problem of calibrating the latent graph distribution while achieving optimal point predictions. 3. Problem Formulation Consider a set of N interacting entities and the datagenerating process ( A P A y = f (x, A) (1) where y Y is the system output obtained from input x X through function f and conditioned on a realization of the latent adjacency matrix A A {0, 1}N N drawn from distribution P A; input x is assumed to be drawn from any distribution P x and superscript refers to unknown entities. Each entry of the adjacency matrix A is a binary value encoding the existence of a pairwise relation between two nodes. In the sequel, x X RN din and y Y RN dout are stacks of N node-level feature vectors of dimension din and dout, respectively, representing continuous inputs and outputs. Given a training dataset D = {(xi, yi)}n i=1 of n inputoutput observations from (1), we aim at learning a probabilistic predictive model ( A P θ A ˆy = fψ(x, A) (2) from D, while learning at the same time distribution P θ A approximating P A. The two parameter vectors θ and ψ are trained to approximate distinct entities in (1), namely the distribution P A and function f , respectively. We assume Assumption 3.1. The family {P θ A} of probability distributions P θ A parametrized by θ and the family of predictive functions {fψ} are expressive enough to contain the true latent distribution P A and function f , respectively. Assumption 3.1 implies that f {fψ} and P A {P θ A} but does not request uniqueness of the parameters vectors ψ and θ such that fψ = f and P θ A = P A. Under such assumption the minimum function approximation error is null and we can focus on the theoretical conditions requested to guarantee successful learning, i.e., achieving both optimal point predictions and latent distribution calibration. In Section 6.2, we empirically show that the theoretical results can extend beyond this assumption in practice. Optimal point predictions Outputs y and ˆy of probabilistic model (1) and (2) are random variables following push-forward distributions 2 P y|x and P θ,ψ y|x , respectively. A single point prediction y P P Y can be obtained through an appropriate functional T[ ] as y P P = y P P (x, θ, ψ) T h P θ,ψ y|x i . (3) For example, T can be the expected value or the value at a specific quantile. We then define an optimal predictor as one whose parameters θ and ψ minimize the expected point-prediction loss Lpoint(θ, ψ) = Ex P x h Ey P y|x ℓ y, y P P (x, θ, ψ) i (4) between the system output y and the point-prediction y P P , as measured by of a loss function ℓ: Y Y R+. Statistical functional T is coupled with the loss ℓas the optimal functional T to employ given a specific loss ℓis often known (Berger, 1990; Gneiting, 2011), when P θ,ψ y|x approximates well P y|x. For instance, if ℓis the Mean Absolute Error (MAE) the associated functional T is the median, if ℓis the Mean Squared Error (MSE) the associated functional is the expected value. Latent distribution calibration Calibration of a parametrized distribution P θ A requires learning parameters θ, so that P θ A aligns with true distribution P A. Quantitatively, a dissimilarity measure cal : PA PA R+, defined over a set PA of distributions on A, assesses how close two distributions are. The family of f-divergences (R enyi, 1961), such as the Kullback-Leibler divergence, and the integral probability metrics (M uller, 1997), such as the maximum mean discrepancy (Gretton et al., 2012) are examples of such dissimilarity measures. In this paper, we are interested in those discrepancies for which cal(P1, P2) = 0 P1 = P2 holds. It follows that the latent distribution P θ A is calibrated on P A if it minimizes the latent distribution loss Lcal = Ex P x cal P A, P θ A , (5) or simply Lcal = cal P A, P θ A , when A and x are independent. The problem of designing a predictive model (2) that both yields optimal point predictions (i.e., minimizes Lpoint in (4)) and calibrates the latent distribution (i.e., minimizes Lcal in (5)) is non-trivial for two main reasons. At first, as the latent distribution P A is unknown (and no samples from it are available), we cannot directly estimate Lcal. Second, as shown in Section 4, multiple sets of θ parameters may minimize Lpoint without minimizing Lcal. 2The distribution of y = f (x, A) originated from P A and of ˆy = fψ(x, A) originated from P θ A. Learning Latent Graph Structures and their Uncertainty 4. Limitations of Point-Prediction Optimization In this section, we demonstrate that the optimization of a point prediction loss (Equation (4)) does not generally grant calibration of the latent random variable A. Proposition 4.1. Consider Assumption 3.1. Loss function Lpoint(θ, ψ) in (4) is minimized by all θ and ψ s.t. T h P θ,ψ y|x i = T h P y|x i almost surely on x and, in particular, Lpoint(θ, ψ) is minimal = = P θ,ψ y|x = P y|x. The proof of the proposition is given in Appendix A.1; we provide a counterexample for which calibration is not granted even when the processing function fψ is equal to f in Appendix A.2. The limitation of point-prediction losses is also empirically demonstrated in Section 6.3, Table 2, where it is shown that optimizing point-prediction losses does not grant calibration Given the provided negative result and the impossibility of assessing loss Lcal in (5), in the next section, we propose another optimization objective that, as we will prove, allows us to both calibrate the latent random variable and to have optimal point predictions. 5. Predictive Distribution Optimization: Two Birds with One Stone In this section, we show that we can achieve an optimal point predictor (2) and a calibrated latent distribution P θ A by comparing push-forward distributions P y|x and P θ,ψ y|x of the outputs y conditioned on input x. In particular, Theorem 5.2 below proves that, under appropriate conditions, minimization of the output distribution loss Ldist(θ, ψ) = Ex P x h (P y|x, P θ,ψ y|x ) i (6) provides calibrated P θ A, even when P A is not available; : Py Py R+ is a dissimilarity measure between distributions over space Y. We assume the following on dissimilarity measure . Assumption 5.1. (P1, P2) 0 for all distributions P1 and P2 in Py and (P1, P2) = 0 if and only if P1 = P2. Several choices of meet Assumption 5.1, e.g., fdivergences and some integral probability metrics (M uller, 1997); the dissimilarity measure employed in this paper is discussed in Section 5.1. Theorem 5.2. Let I = {x : A 7 f (x, A) is injective} X be the set of points x X such that map A 7 f (x, A) is injective. Under Assumptions 3.1 and 5.1, if Px P x (I) > 0 and ψ is such that fψ = f , then Ldist(θ, ψ ) = 0 = ( Lpoint(θ, ψ ) is minimal Lcal(θ) = 0. Theorem 5.2 is proven in Appendix A.3. Under the theorem s hypotheses, a predictor that minimizes Ldist is both calibrated on the latent random distribution and provides optimal point predictions. This overcomes limits of Proposition 4.1 where optimization of Lpoint(θ, ψ ) does not grant Lcal(θ) = 0. The hypotheses under which Theorem 5.2 holds are rather mild. In fact, condition Px P x (I) > 0 pertains to the datagenerating process and intuitively ensures that, for some x, different latent random variables produce different outputs. A sufficient condition for Px P x (I) > 0 to hold is the existence of a point x in the support of P x such that A 7 f ( x, A) is injective with f continuous w.r.t. x; see Corollary A.1 in Appendix A.3. Although only a single point x is required, having more points that satisfy the condition simplifies the training of the parameters. Corollary A.1 holds for arbitrarily complex processing functions f . More specifically, when considering simple GNN layers and discrete latent matrices A, we can prove that the condition Px P x (I) > 0 is except from pathological cases always satisfied (see Proposition A.2 in Appendix A.3). Instead, condition fψ = f is set to avoid scenarios of different, yet equivalent, 3 representations of the latent distribution. An empirical analysis of the theorem s assumptions is provided in Section 6.2, demonstrating that the theoretical results hold in practice, even when those assumptions do not strictly apply. Assumptions 3.1 and 5.1 can be met with an appropriate choice of model (2) and measure ; as such they are controllable by the designer. Assumption 5.1 prevents from obtaining mismatched output distributions when Ldist(θ, ψ) = 0 and can be easily satisfied. As mentioned above, popular measures, e.g., the Kullback-Leibler divergence, meet the theorem s assumptions and therefore can be adopted as . However, as f-divergences rely on the explicit evaluation of the likelihood of y, they are not always practical to compute (Mohamed & Lakshminarayanan, 2016). For this reason, we propose considering the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) as a versatile alternative that allows Monte Carlo computation without requiring evaluations of the likelihood w.r.t. the output distributions P y|x and P θ,ψ y|x . Energy distances (Sz ekely & Rizzo, 2013) provide an alternative feasible choice. 3E.g., fψ(A, x) = f (1 A, x) and P θ A encoding the absence of edges instead of their presence as in P A. Learning Latent Graph Structures and their Uncertainty 5.1. Maximum Mean Discrepancy Given two distributions P1, P2 Py, the MMD can be defined as MMDG[P1, P2] = sup g G Ey P1 g(y) Ey P2 g(y) , (7) i.e., the supremum, taken over a set G of functions Y R, of the difference between expected values w.r.t. P1 and P2. An equivalent form is derived for a generic kernel function κ( , ) : Y Y R: MMD2 Gκ[P1, P2] = E y1,y 1 P1 h κ(y1, y 1) i 2 E y1 P1 y2 P2 h κ(y1, y2) i + E y2,y 2 P2 h κ(y2, y 2) i , (8) and it is associated with the unit-ball Gk of functions in the reproducing kernel Hilbert space of κ; note that (8) is the square of (7). Moreover, when universal kernels are considered (e.g., the Gaussian one), then (8) fulfills Assumption 5.1 (see Theorem 5 in (Gretton et al., 2012)). Dissimilarity in (8) can be conveniently estimated via Monte Carlo (MC) and employed within a deep learning framework. Accordingly, we set = MMD2 Gκ and learn parameter vectors ψ and θ by minimizing Ldist(θ, ψ) via gradientdescent methods. 5.2. Finite-Sample Computation of the Loss To compute the gradient of Ldist(θ, ψ) = Ex P x h MMD2 Gκ h P θ,ψ y|x , P y|x ii w.r.t. parameter vectors ψ and θ, we rely on MC sampling to estimate in (6) expectations over input x P x, target output y P y|x and model output ˆy P θ,ψ y|x . This amounts to substitute MMD2 Gκ with \ MMD 2 θ,ψ(x, y) = 2 PNadj j 1 is the number of adjacency matrices sampled from P θ A to obtain output samples ˆyi = fψ(x, Ai) P θ,ψ y|x , whereas the pair (x, y) is a pair from the training set D. We remark that in (9) the third term of (8) i.e., the one associated with the double expectation with respect to P y|x is neglected as it does not depend on ψ and θ. Gradient ψLdist(θ, ψ) is computed via automatic differen- tiation by averaging ψ \ MMD 2(θ, ψ) within a mini-batch of observed data pairs (xi, yi) D. For θLdist(θ, ψ), the same approach is not feasible. This limitation arises because the gradient is computed with respect to the same parameter vector θ that defines the integrated distribution. Here, we rely on a score-function gradient estimator (SFE) (Williams, 1992; Mohamed et al., 2020) which uses the log derivative trick to rewrite the gradient of an expected loss L as θEA P θ[L(A)] = EA P θ[L(A) θ log P θ(A)], with P θ(A) denoting the likelihood of A P θ. Applying the SFE to our problem the gradient w.r.t. θ reads: θLdist = E x,y h κ(ˆy1, ˆy2) θ log P θ,ψ y|x (ˆy1)P θ,ψ y|x (ˆy2) i h κ(y , ˆy) θ log P θ,ψ y|x (ˆy) i (10) where ˆy1, ˆy2, ˆy P θ,ψ y|x . An apparent setback of SFEs is their high variance (Mohamed et al., 2020), which we address in Section 5.3 by deriving a variance-reduction technique based on control variates that requires negligible computational overhead. 5.3. Variance-Reduced Loss for SFE Two natural approaches to reduce the variance of MC estimates of (10) involve (i) increasing the number B of training data points in the mini-batch used for each gradient estimate and (ii) increasing the number Nadj of adjacency matrices sampled for each data point in (9). These techniques act on two different sources of noise. Increasing B decreases the variance coming from the data-generating process, whereas increasing Nadj improves the approximation of the predictive distribution P θ,ψ y|x . Nonetheless, by fixing B and Nadj, it is possible to further reduce the latter source of variance by employing the control variates method (Mohamed et al., 2020) that, in our case, requires only a negligible computational overhead but sensibly improves the training speed (see Section 6). Consider the expectation EA P θ[L(A) θ log P θ(A)] of the SFE both terms in (10) can be cast into that form. With the control variates method, a function with null expectation is subtracted from L(A) θ log P θ(A). G(A) = L(A) θ log P θ(A) β h(A) EA P θ[h(A)] (11) that leads to a reduced variance in the MC estimator of the gradient while maintaining it unbiased. In this paper, we set function h(A) to θ log P θ(A) and show how to compute a near-optimal choice for scalar value β, often called baseline in the literature. As the expected value of θ log P θ(A) is zero, gradient (10) rewrites as θLdist = E x,y 2E A (κ(y , ˆy) β2) θ log P θ A(A) h (κ(ˆy1, ˆy2) β1) θ log P θ A(A1)P θ A(A2) i . Learning Latent Graph Structures and their Uncertainty In Appendix B, we show that in our setup the best values of β1 and β2 are approximated by β1 = Ex h EA1,A2 P θ A h κ fψ(x, A1), fψ(x, A2) ii , β2 = Ex,y h EA P θ A h κ y , fψ(x, A) ii , (13) which can be efficiently computed via MC reusing the kernel values already computed for (12). 5.4. Computational Complexity Focusing on the most significant terms, for every data pair (x, y) in the training set, computing the loss Ldist requires O(N 2 adj) kernel evaluations κ(ˆyi, ˆyj) in (9), O(Nadj) forward passes through the GNN ˆyi = fψ(x, Ai) in (9) and O(Nadj) likelihood computations P θ A(Ai) in (12). The computation of baselines β1 and β2 in (13) requires virtually no overhead, as commented in previous Section 5.3. Similarly, computing the loss gradients requires O(N 2 adj) derivatives for what concerns the kernels, O(Nadj) gradients ψˆyi and θ log P θ A(Ai). We empirically observed that for Nadj 16, both the latent distribution loss Lcal and the point prediction loss Lpoint of final models are equivalent for the considered problem. This suggests that Nadj is not a critical hyperparameter. Since we can employ sparse representations of adjacency matrices, the GNN processing costs scale linearly in the number of nodes N for bounded-degree graphs. From our experience, the GNN processing is the most demanding operation and the cost of quadratic components, such as the parameterization of θij, do not pose significant overhead. 6. Experiments This section empirically validates the proposed technique and the main claims of the paper. While point predictions can be evaluated on observed input-output pairs (x, y) provided as a test set, assessing latent-variable calibration performance the discrepancy between P A and the learned P θ A requires knowledge of the ground-truth latent distribution itself or of observation thereof. Such ground-truth knowledge, however, is not available in real-world datasets, as the latent distribution is indeed unknown. Therefore, to validate the theoretical results, we designed a synthetic dataset that allows us to evaluate different performance metrics on both y and A. We remark that the latent distribution is used only to assess performance and does not drive the model training in any way. Section 6.1 demonstrates that the proposed approach can successfully solve the joint learning problem across different graph sizes and highlights the effectiveness of the proposed variance reduction technique. Section 6.2 empirically investigates the generality of the theoretical results we develop, demonstrating appropriate calibration of the latent distribution even in scenarios where the assumptions of Theorem 5.2 do not hold. Section 6.3 demonstrates that the proposed approach is more effective than existing methods in solving the joint learning task. As a last experiment, in Appendix C.5 we test our approach and show that sensible graph distributions can be learned in real-world settings. Dataset and models Consider data-generating process (1) with latent distribution P A = P θ A producing N-node adjacency matrices. Random graph A P A is given as a set of independent edges (i, j), for i, j = 1, . . . , N, each of which is sampled with probability θ i,j. Function f = fψ is a generic GNN with node-level readout, i.e., fψ ( , A) : RN din RN dout. The components θ are set to either 0 or 3/4 according to the pattern depicted in Figure 1; additional specifics are detailed in Appendix C. We result in a dataset of 35k input-output pairs (x, y), 80% of which are used as training set, 10% as validation set, and the remaining 10% as test set. As predictive model family (2), we follow the same architecture of fψ and P θ A ensuring that during all the experiments Assumption 3.1 is fulfilled; similar models have been used in the literature (Franceschi et al., 2019; Elinas et al., 2020; Kazi et al., 2022; Cini et al., 2023). In Section 6.2 we test the method s validity beyond this assumption. The model parameters are trained by optimizing the expected squared MMD in (9) with the rational quadratic kernel (Bi nkowski et al., 2018). 6.1. Graph Structure Learning & Optimal Point Predictions To test our method s ability to both calibrate the latent distribution and make optimal predictions, we train the model minimizing Ldist as described in Section 5.2. Figure 2 reports the validation losses during training: MMD loss Ldist as in (6), MAE between the learned parameters θ and the ground truth θ as Lcal (5), and point-prediction loss Lpoint as in (4) with ℓbeing the MSE. The results are averaged over 8 different model initializations and error bars report 1 standard deviations from the mean. Results are reported with and without applying the variance reduction (Section 5.3), by training only parameters θ while freezing ψ to ψ (same setting of Theorem 5.2), and by jointly training both ψ and θ. Solving the joint learning problem Figure 2a shows that the training succeeded and the MMD loss Ldist converged to its minimum value. 4 Having minimized Ldist, from Figure 2b we see that also the calibration of latent distri- 4Numerical estimation shows that the minimum value of Ldist for the given kernel is approximately 0.088; note that although the MMD2 0, the third term in (8) is dropped from (9). Learning Latent Graph Structures and their Uncertainty Figure 1. Adjacency matrices sampled from P A = P θ A for the experiment of Section 6 are subgraphs of the top graph; in this picture, 3 communities of an arbitrarily large graph are shown. Each edge in orange is independently sampled with probability θ ij; parameters θ ij defining the edge probabilities are represented at the bottom for a two communities graph. bution P θ A was successful; in particular, the figure shows that the MAE on θ parameters (N 2 θ θ 1) approaches zero as training proceeds (MAE < 0.01). Regarding the point predictions, Figure 2c confirms that Lpoint reached its minimum value; recall that optimal prediction MSE is not 0, as the target variable y is random, and note that a learning rate reduction is applied at epoch number 5. The optimality of the point-prediction is supported also by the performance on separate test data and with respect to the MSE as point-prediction loss ℓ. Moreover, we observe that calibration is achieved regardless of the variance reduction and whether or not parameters ψ are trained. Lastly, Figure 5 in Appendix C.2 shows the learned parameters θ of the latent distribution and the corresponding absolute discrepancy resulted from a (randomly chosen) training run. Variance reduction effectiveness Figures 2a, 2b and 2c demonstrate that the proposed variance reduction method (Section 5.2) yields notable advantages training speed up (roughly 50% faster). For this reason, the next experiments rely on variance reduction. Larger graphs The theoretical results developed hold for any number of nodes N. However, the number of possible edges scales quadratically in the number of nodes a potential issue inherent to the GSL problem, not our approach. Therefore, for extremely large graphs the ratio between the Table 1. Calibration of P θ A under varying levels of misconfiguration for predictive function fψ. Results are the mean 1 standard deviation assessed over 8 independent runs. Max pert. Ψ MAE on θ Max AE on θ 0 0.009 0.001 0.06 0.01 0.1 0.010 0.001 0.07 0.01 0.2 0.012 0.004 0.08 0.02 0.5 0.028 0.011 0.16 0.06 0.8 0.047 0.009 0.28 0.06 number of free parameters in θ and the size of the training set can become prohibitive. Nonetheless, in Figure 7, we show all 15K parameters of the considered P θ A can be effectively learned even for relatively large graphs; the final MAE on θ parameters is 0.003. 6.2. Beyond Assumption 3.1 In this section, we empirically study whether Assumption 3.1 is restrictive in practical applications. Specifically, we consider different degrees of model mismatch between the system model in (1) and the approximating model in (2). Unless otherwise specified, we use the same dataset and experimental setup as described in Appendix C.1. Additional details and results are deferred to Appendix C.3. Perturbed fψ As a first experiment, we train P θ A while keeping the parameters of the predictive function fψ fixed to a random perturbation of the data-generating model f = fψ . A perturbed version of f ψ is built by uniformly drawing independent perturbation scalar values δi U[ Ψ, Ψ], one for each parameter ψ i of fψ . Then, each parameter of GNN fψ is given as ψi = (1 + δi)ψ i . Table 1 shows that the learned latent distribution remains reasonably calibrated. Finally, Figures 8-11 show the learned parameter vectors θ for randomly extracted runs and highlight that the maximum AE of Table 1 is observed only sporadically. Generic GNN as fψ In this second experiment, we set fψ to be a generic multilayer GNN which we jointly train with graph distribution P θ A. The model family {fψ} employed does not include f , as f uses L-hop adjacency matrices generated from the sampled adjacency matrix A, while fψ relies on multiple nonlinear 1-hop layers; additional details are reported in Appendix C.3. Upon convergence, models achieved Lpoint < 0.19 using the MSE as loss function ℓin (4); The performance is in line with results in Figure 2c. At last, note that as the GNN used adds self-loops, the diagonal elements of the adjacency matrix are learned as zero, resulting in a larger MAE on θ (see Figure 12). However, this does not impair learning the off-diagonal θij parameters (i.e., for i = j). Notably, in the worst-performing model, these off-diagonal parameters achieve a MAE of less than Learning Latent Graph Structures and their Uncertainty Figure 2. Validation losses Ldist, Lcal and Lpoint during training. At epoch 5, the learning rate is decreased to ensure convergence. Ldist in Subfigure 2a is negative as the third term in (8) is constant and not considered. Table 2. Calibration and point-prediction performance of models trained by minimizing different loss functions. Losses Ldist follow the approach proposed in this paper. Bold numbers indicate the best-performing models (p-value of the Welch s t-test < 0.01). Train loss MAE on θ MAE on y MSE on y Lliterature 1,ℓ: MAE .087 .001 .270 .003 .180 .003 Lliterature 1,ℓ: MSE .087 .001 .293 .001 .161 .002 Lliterature 2,ℓ: MAE .086 .001 .270 .002 .176 .002 Lliterature 2,ℓ: MSE .085 .001 .295 .001 .161 .002 Lliterature elbo .082 .001 .310 .010 .191 .020 Lpoint ℓ: MSE .025 .001 .271 .003 .161 .002 Ldist : CRPS .010 .002 .269 .001 .159 .001 Ldist : MMD .009 .001 .269 .001 .159 .001 0.03, effectively calibrating the latent distribution. Misconfigured P θ A Finally, we violate Assumption 3.1 by fixing fψ = f and constraining some components of θ to incorrect values. Specifically, we force parameters θi,j for all edges i, j associated with nodes with id 2 and 3 in Figure 1 to 0.25, instead of the correct value of θ i,j = 0.75 as in P A. Results indicate that the free parameters in θ are learned appropriately. Notably, increased uncertainty is observed for spurious edges linking to nodes in the first node community (see Figure 1). This is expected given that nearly 60% of the edges in the community were significantly downsampled. Figures 13 and 14 in Appendix C.3 show the learned parameters from randomly selected runs. 6.3. Comparison of Loss Functions As a final experiment, we empirically demonstrate that the proposed choice of loss functions (6) is more effective at calibrating the latent graph distribution, while maintaining or sometimes improving point prediction performance compared to other commonly used loss functions. Considered loss functions Following our approach we consider two distributional losses, based on the MMD Ldist : MMD and the energy distance 5 Ldist : CRPS. As point prediction loss, we use Lpoint ℓ: MSE defined in (4) based on the mean squared error. Additionally, we consider three families of losses used in the GSL literature. The first one, defined as Lliterature 1,ℓ = Ex,y EA P θ A [ℓ(fψ(x, A), y )] (14) has been employed, e.g., in Franceschi et al. (2019) and Cini et al. (2023). Note that, differently from Lpoint ℓ:MSE, the expectation over A is taken outside function ℓ. The second family, denoted as Lliterature 2,ℓ , is inspired by Kazi et al. (2022). Lliterature 2,ℓ refines Lliterature 1,ℓ focusing its optimization to nodelevel predictions; further details follow in Appendix C.4. For Lliterature 1,ℓ and Lliterature 2,ℓ , we use both MAE and MSE as ℓ. Finally, we adapt the loss function used in (Elinas et al., 2020) for the synthetic regression task: Lliterature elbo = Ex,y EA P θ A h log(P ψ y|x ,A(y )) i + KL P θ A(A)|| PA(A) (15) where PA(A) is a prior distribution and P ψ y|x ,A(y ) is a Gaussian distribution whose mean vector is determined for each node by the GNN output and standard deviation is set as a hyperparameter. We explored different standard deviations and choices of the prior. Details can be found in Appendix C.4 Results on point prediction Table 2 shows that models trained with Lliterature 1,ℓ , Lliterature 2,ℓ and Lpoint ℓ achieve nearoptimal 6 point predictions according to their respective 5By following Section 5.2, the energy distance reduces to the well-known Continuous Ranked Probability Score (CRPS) (Gneiting & Raftery, 2007). 6Numerical estimates suggest that the ground-truth optimal MAE and MSE achievable by a predictor are approximately 0.267 and 0.158, respectively. Learning Latent Graph Structures and their Uncertainty performance metric (MAE or MSE). Namely, optimizing Lliterature 1,ℓ:MAE and Lliterature 2,ℓ:MAE leads to minimal MAE, but not to minimal MSE; similarly, optimizing Lliterature 1,ℓ:MSE and Lliterature 2,ℓ:MSE results in minimal MSE. Conversely, predictors trained with either Ldist :MMD or Ldist :CRPS achieve optimal prediction performance for both metrics. Interestingly, also Lpoint ℓ:MSE leads to near-optimal predictions in terms of both MAE and MSE. We attribute the superiority of Lpoint over Lliterature 1,ℓ and Lliterature 2,ℓ to the use of functional T in (3) which enables a more accurate probabilistic modeling of the data-generating process. A similar observation holds for the calibration error, discussed in the next paragraph. Results on calibration Optimizing the proposed losses (Ldist :MMD or Ldist :CRPS) yields the smallest calibration errors, as measured by the MAE of the latent distribution parameters θ in P θ A. In contrast, loss functions commonly used in the literature result in statistically worse calibration performance. Notably, while the point-prediction loss Lpoint ℓ:MSE outperforms Lliterature 1 and Lliterature 2 in terms of calibration error, it remains statistically inferior to the proposed distributional loss Ldist . We conclude that the proposed approach was the only one capable of effectively solving the joint learning problem. 7. Conclusions Graph structure learning has emerged as a research field focused on learning graph topologies in support of solving downstream predictive tasks. Assuming stochastic latent graph structures, we are led to a joint optimization objective: (i) accurately learning the distribution of the latent topology while (ii) achieving optimal prediction performance on the downstream task. In this paper, at first, we prove both positive and negative theoretical results to demonstrate that appropriate loss functions must be chosen to solve this joint learning problem. Second, we propose a sampling-based learning method that does not require the computation of the predictive likelihood. Our empirical results demonstrate that this approach achieves optimal point predictions on the considered downstream task while also yielding calibrated latent graph distributions. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgments This work was supported by the Swiss National Science Foundation project FNS 204061: HORD GNN: Higher Order Relations and Dynamics in Graph Neural Networks and partly supported by International Partnership Program of the Chinese Academy of Sciences under Grant 104GJHZ2022013GC. Bartler, A., Wiewel, F., Mauch, L., and Yang, B. Training variational autoencoders with discrete latent variables using importance sampling. In 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1 5. IEEE, 2019. Berger, J. O. Statistical decision theory. In Time Series and Statistics, pp. 277 284. Springer, 1990. Bi nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying mmd gans. In International Conference on Learning Representations, 2018. Cini, A., Zambon, D., and Alippi, C. Sparse graph learning from spatiotemporal time series. Journal of Machine Learning Research, 24:1 36, 2023. Connor, M., Canal, G., and Rozell, C. Variational autoencoder with learned latent structure. In International Conference on Artificial Intelligence and Statistics, pp. 2359 2367. PMLR, 2021. Coutino, M., Isufi, E., Maehara, T., and Leus, G. State-space network topology identification from partial observations. IEEE Transactions on Signal and Information Processing over Networks, 6:211 225, 2020. De Felice, G., Cini, A., Zambon, D., Gusev, V., and Alippi, C. Graph-based Virtual Sensing from Sparse and Partial Multivariate Observations. In The Twelfth International Conference on Learning Representations, 2024. Deleu, T., G ois, A., Emezue, C., Rankawat, M., Lacoste Julien, S., Bauer, S., and Bengio, Y. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pp. 518 528. PMLR, 2022. Dong, X., Thanou, D., Frossard, P., and Vandergheynst, P. Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):6160 6173, 2016. Elinas, P., Bonilla, E. V., and Tiao, L. Variational inference for graph convolutional networks in the absence of graph data and adversarial settings. Advances in Neural Information Processing Systems, 33:18648 18660, 2020. Learning Latent Graph Structures and their Uncertainty Fatemi, B., El Asri, L., and Kazemi, S. M. Slaps: Selfsupervision improves structure learning for graph neural networks. Advances in Neural Information Processing Systems, 34:22667 22681, 2021. Fey, M. and Lenssen, J. E. Fast graph representation learning with pytorch geometric. ar Xiv preprint ar Xiv:1903.02428, 2019. Fout, A., Byrd, J., Shariat, B., and Ben-Hur, A. Protein interface prediction using graph convolutional networks. Advances in neural information processing systems, 30, 2017. Franceschi, L., Niepert, M., Pontil, M., and He, X. Learning discrete structures for graph neural networks. In International conference on machine learning, pp. 1972 1982. PMLR, 2019. Gneiting, T. Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 106 (494):746 762, June 2011. ISSN 0162-1459. doi: 10.1198/jasa.2011.r10138. Gneiting, T. and Raftery, A. E. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477):359 378, 2007. ISSN 0162-1459. Gomez Rodriguez, M., Leskovec, J., and Sch olkopf, B. Structure and dynamics of information pathways in online media. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 23 32, 2013. Gray, C., Mitchell, L., and Roughan, M. Bayesian inference of network structure from information cascades. IEEE Transactions on Signal and Information Processing over Networks, 6:371 381, 2020. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International conference on machine learning, pp. 1321 1330. PMLR, 2017. Hadjeres, G., Nielsen, F., and Pachet, F. Glsr-vae: Geodesic latent space regularization for variational autoencoder architectures. In 2017 IEEE symposium series on computational intelligence (SSCI), pp. 1 7. IEEE, 2017. Harris, C. R., Millman, K. J., Van Der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., et al. Array programming with numpy. Nature, 585(7825):357 362, 2020. Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(03):90 95, 2007. Jiang, B., Zhang, Z., Lin, D., Tang, J., and Luo, B. Semisupervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11313 11320, 2019. Joo, W., Lee, W., Park, S., and Moon, I.-C. Dirichlet variational autoencoder. Pattern Recognition, 107:107514, 2020. Kalofolias, V. How to learn a graph from smooth signals. In Artificial intelligence and statistics, pp. 920 929. PMLR, 2016. Kazi, A., Cosmo, L., Ahmadi, S.-A., Navab, N., and Bronstein, M. M. Differentiable graph module (dgm) for graph convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1606 1617, 2022. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kipf, T., Fetaya, E., Wang, K.-C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In International conference on machine learning, pp. 2688 2697. PMLR, 2018. Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning, pp. 2796 2804. PMLR, 2018. Laves, M.-H., Ihler, S., Fast, J. F., Kahrs, L. A., and Ortmaier, T. Well-calibrated regression uncertainty in medical imaging with deep learning. In Medical imaging with deep learning, pp. 393 412. PMLR, 2020. Lokhov, A. Reconstructing parameters of spreading models from partial observations. Advances in Neural Information Processing Systems, 29, 2016. Mateos, G., Segarra, S., Marques, A. G., and Ribeiro, A. Connecting the dots: Identifying network structure via graph signal processing. IEEE Signal Processing Magazine, 36(3):16 43, 2019. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937. PMLR, 2016. Learning Latent Graph Structures and their Uncertainty Mohamed, S. and Lakshminarayanan, B. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, 2016. Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. Monte carlo gradient estimation in machine learning. The Journal of Machine Learning Research, 21(1):5183 5244, 2020. Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 4602 4609, 2019. Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P., and Dokania, P. Calibrating deep neural networks using focal loss. Advances in Neural Information Processing Systems, 33:15288 15299, 2020. M uller, A. Integral probability metrics and their generating classes of functions. Advances in applied probability, 29 (2):429 443, 1997. M uller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? Advances in neural information processing systems, 32, 2019. Niepert, M., Minervini, P., and Franceschi, L. Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions. In Advances in Neural Information Processing Systems, volume 34, pp. 14567 14579. Curran Associates, Inc., 2021. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. Pu, X., Cao, T., Zhang, X., Dong, X., and Chen, S. Learning to learn graph topologies. Advances in Neural Information Processing Systems, 34:4249 4262, 2021. R enyi, A. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, volume 4, pp. 547 562. University of California Press, 1961. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278 1286. PMLR, 2014. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008. Shafer, G. and Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008. Shang, C., Chen, J., and Bi, J. Discrete graph structure learning for forecasting multiple time series. In International Conference on Learning Representations, 2021. Shlomi, J., Battaglia, P., and Vlimant, J.-R. Graph neural networks in particle physics. Machine Learning: Science and Technology, 2(2):021001, 2020. Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999. Sz ekely, G. J. and Rizzo, M. L. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249 1272, 2013. Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. Wasserman, M. and Mateos, G. Graph structure learning with interpretable bayesian neural networks. Transactions on machine learning research, 2024. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992. Wu, Z., Pan, S., Long, G., Jiang, J., and Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 1907 1913, 2019. Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., and Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 753 763, 2020. Xu, J. and Durrett, G. Spherical latent spaces for stable variational autoencoders. ar Xiv preprint ar Xiv:1808.10805, 2018. Yu, D., Zhang, R., Jiang, Z., Wu, Y., and Yang, Y. Graphrevised convolutional network. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14 18, 2020, Proceedings, Part III, pp. 378 393. Springer, 2021. Zadrozny, B. and Elkan, C. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 609 616, 2001. Learning Latent Graph Structures and their Uncertainty Zhang, Y., Pal, S., Coates, M., and Ustebay, D. Bayesian graph convolutional neural networks for semi-supervised classification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 5829 5836, 2019. Zheng, Y., Liu, F., and Hsieh, H.-P. U-air: When urban air quality inference meets big data. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1436 1444, 2013. Zhu, Y., Xu, W., Zhang, J., Liu, Q., Wu, S., and Wang, L. Deep graph structure learning for robust representations: A survey. ar Xiv preprint ar Xiv:2103.03036, 14: 1 1, 2021. Learning Latent Graph Structures and their Uncertainty A. Proofs of the Theoretical Results A.1. Minimizing Lpoint does not guarantee calibration In this section, we prove Proposition 4.1. Proof of Proposition 4.1. Recall the definition of Lpoint in (4) using (3) Lpoint(ψ, θ) = Ex h Ey P y|x h ℓ y , T P θ,ψ y|x ii Given loss function ℓ, T is, by definition (Berger, 1990; Gneiting, 2011), the functional that minimizes h ℓ y , T P y|x i Therefore, if P θ,ψ y|x = P y|x = Lpoint is minimal. If another distribution over y, namely, P ψ ,θ y|x parametrized by θ and ψ satisfies: T h P ψ ,θ y|x i = T h P y|x i almost surely on x, then, Lpoint(θ , ψ ) = Ex h Ey P y|x h ℓ y , T P ψ ,θ = Ex h Ey P y|x h ℓ y , T P y|x ii Thus, P ψ ,θ y|x minimizes Lpoint. Appendix A.2 discusses graph distributions as counterexamples where T P ψ ,θ y|x = T P y|x but P ψ ,θ y|x = P y|x. By this, we conclude that reaching the minimum of Lpoint(ψ, θ) does not always imply P ψ,θ y|x = P y|x. A.2. Minimizing Lpoint does not guarantee calibration: an example with MAE This section shows that Lpoint equipped with MAE as ℓadmits multiple global minima for different parameters θ, even for simple models and fψ = f . Consider a single Bernoulli of parameter θ > 1/2 as latent variable A and a scalar function f (x, A) such that f (x, 1) > f (x, 0) for all x. Given input x the value of functional T(P y|x) that minimizes h y T P y|x i = θ f (x, 1) T P y|x + (1 θ ) f (x, 0) T P y|x is T(P y|x) = f (x, 1); this derives from the fact that range of f is {f (x, 0), f (x, 1)} and the likelihood of f (x, 1) is larger than that of f (x, 0). Note that T P y|x = f (x, 1) for all x, therefore also Lpoint is minimized by such T. Moreover, T P y|x is function of θ and equal to f (x, 1) for all θ > 1/2. We conclude that for any θ = θ distributions P θ,ψ y|x and P y|x are different, yet both of them minimize Lpoint if θ > 1/2. A similar reasoning applies for θ < 1/2. A.3. Minimizing Ldist guarantees calibration and optimal point predictions This section proves Theorem 5.2 and a corollary of it. Proof of Theorem 5.2. Recall from Equation (6) that Ldist(θ) = Ex h (P y|x, P θ,ψ y|x ) i Learning Latent Graph Structures and their Uncertainty We start by proving that if Ldist(θ, ψ) = 0 = Lpoint(θ, ψ) is minimal. Note that Ldist(θ, ψ) = 0 implies that (P y|x, P θ,ψ y|x ) = 0 almost surely in x. Then, by Assumption 5.1, P y|x = P θ,ψ y|x almost surely on x and, in particular, T[P y|x] = T[P θ,ψ y|x ], which leads to Lpoint(ψ, θ) being minimal (Proposition 4.1). We now prove that if Ldist(θ, ψ ) = 0 = Lcal(θ) = 0. From the previous step, we have that Ldist(θ, ψ) = 0 implies P y|x = P θ,ψ y|x almost surely for x I. Under the assumption that fψ = f and the injectivity of f in such x I, for any output y a single A exists such that f (x, A) = y. Therefore, the probability mass function of y equals that of A. Accordingly, P y|x = P θ,ψ y|x implies P A = P θ A. We also prove a corollary of Theorem 5.2. Corollary A.1. Under Assumptions 3.1 and 5.1, if 1. x Supp(P x) X such that f ( x; ) is injective, 2. f (x, A) is continuous in x, A A, Ldist(θ, ψ ) = 0 = ( Lpoint(θ, ψ ) is minimal Lcal(θ) = 0. The corollary shows that it is sufficient that f is continuous in x and there exists one point x where f ( x, ) is injective to meet theorem s hypothesis Px P x (I) > 0; we observe that, as A is discrete, the injectivity assumption is not as restrictive as if the domain were continuous. Proof. As A is a finite set, the minimum ϵ = min A =A A f ( x, A) f ( x, A ) > 0 exists and, by the injectivity assumption, is strictly positive. By continuity of f ( , A), for every ϵ < 1 2 ϵ there exists δ, such that for all x B( x, δ) we have f ( x, A) f (x, A) < ϵ. It follows that, x B, f (x, A) f (x, A ) f ( x, A) f ( x, A ) f ( x, A) f (x, A) f ( x, A ) f (x, A ) f ( x, A) f ( x, A ) 2ϵ f ( x, A) f ( x, A ) ϵ > 0. Where the second inequality holds for the continuity of x 7 f (x, A) A. Finally, as x Supp(P x) and B( x, δ) I, we conclude that Px(I) Px(B( x, δ)) > 0, therefore, we are in the hypothesis of Theorem 5.2 and can conclude that Ldist(θ, ψ ) = 0 = ( Lpoint(θ, ψ ) is minimal Lcal(θ) = 0. A.4. Injectivity hypothesis for graph neural networks Now, we show that hypothesis Px P x (I) > 0 of Theorem 5.2 is always met for certain families of graph neural networks. Proposition A.2. Consider a 1-layer GNN of the form f (x, A) : σ(Ax) = y, with x, y RN and nonlinear bijective activation function σ. If the support Supp(P x) of x contains any ball B in RN then Px P x (I) > 0. Learning Latent Graph Structures and their Uncertainty To prove Proposition A.2, we rely on following lemma. Lemma A.3. Given g(x, a) = ax with a {0, 1}1 N and x RN 1. Let Ig = {x : g(x, a) is injective in a} X be the set of points x X such that map a 7 g(x, a) is injective. The following implication holds: x Ig δ = 0 { 1, 0, 1}1 N s.t. δ x. (16) Proof. We prove the two implications separately. ( = ) If x Ig, then there exist a , a {0, 1}1 N with a = a such that a x = a x. This implies that (a a )x = 0. Defining δ as (a a ), we have proven that there exist δ = 0 { 1, 0, 1}1 N such that δx = 0, i.e., δ x. ( = ) Assume that δ = 0 { 1, 0, 1}1 N such that δ x. Each component δi of δ can be written as the difference between two values a i, a i {0, 1}. As δ = 0 then there exists at least one index j {1, . . . , N} such that a j = a j . This implies that a , a {0, 1}1 N with a = a s.t. (a a )x = 0, which implies that x Ig. Proof of Proposition A.2. We begin by considering the projection g(x, a) = ax with a {0, 1}1 N and x RN. Then we extend to A {0, 1}N N and to nonlinear functions. Let IC g = RN \ I g be the complement in RN of I g. Recalling Lemma A.3 and its notation, we have 3N 1 possible δ, defining a collection of (3N 1)/2 hyperplanes of vectors x perpendicular to at least one δ; set IC g is the union of such a finite collection of hyperplanes. By hypothesis, Supp(P x) contains a ball B RN, therefore Supp(P x) IC g and Px P x (IC g ) < 1. We conclude that Px P x (I g) = 1 Px P x (IC g ) > 0. A similar result is proven for G(x, A) = Ax with A {0, 1}N N. In fact, G is a stack of N functions g above and I G = I g. Finally, composing injective function G with injective function σ leads to function g(x, A) = σ(G(x, A)) being injective in A for the same points x for which G is injective, thus proving the proposition. B. Estimation of Optimal β1 and β2 Here we show that, when reducing the variance of the SFE via control variates in (12), the best β1 and β2 can be approximated by β1 = E x P x A1,A2 P θ A h κ (fψ(x, A1), fψ(x, A2)) i , β2 = E (x,y ) P x,y A P θ A h κ (y , fψ(x, A)) i , (17) Consider generic function L(A) depending on a sample A of a parametric distribution P θ A(A) and the surrogate loss L(A) in (11), i.e., G(A) = L(A) θ log P θ(A) β h(A) EA P θ[h(A)] ; (18) Following existing literature (Sutton et al., 1999; Mnih et al., 2016) where β is often referred to as baseline we set h(A) = θ log P θ(A). The 1-sample MC approximation of the gradient becomes θEA P θ[L(A)] G(A ) = (L(A ) β) θ log P θ(A ), (19) with A sampled from P θ A. The variance of the estimator is VA P θ (L(A) β) θ log P θ(A) = VA P θ L(A) θ log P θ(A) + + β2 EA P θ h θ log P θ(A) 2i 2β EA P θ h L(A) θ log P θ(A) 2i (20) and the optimal value β that minimizes it is β = EA P θ h L(A) θ log P θ(A) 2i EA P θ h ( θ log P θ(A))2i . (21) Learning Latent Graph Structures and their Uncertainty If we approximate the numerator with E[L(A)]E[( θ log P θ(A))2], we obtain that β E[L(A)]. By substituting L(A) with the two terms of (10) we get the values of β1 and β2 in (17). We experimentally validate the effectiveness of this choice of β in Section 6. C. Further Experimental Details C.1. Dataset description and models In this section, we describe the considered synthetic dataset, generated from the system model (1). The latent graph distribution P A is a multivariate Bernoulli distribution of parameters θ ij: P A Pθ (A) = Q ij θ Aij ij (1 θ ij)(1 Aij). The components of θ are all null, except for the edges of the graph depicted in Figure 3 which are set to 3/4. A heatmap of the adjacency matrix can be found in Figure 4. Figure 3. The adjacency matrices used in this paper are sampled from this graph. Each edge in orange is independently sampled with probability θ . In the picture, 3 communities of an arbitrarily large graph are shown. Figure 4. θ ij parameters for each edge of the latent adjacency matrix. Each square corresponds to an edge, and the number inside is the probability of sampling that edge for each prediction. Table 3. Table of the parameters used to generate the synthetic dataset. Parameter Values ψ 1 [0.3, 0.2, 0.1, 0.2] ψ 2 [ 0.3, 0.1, 0.2, 0.1] Regarding the GNN function f , we use the following system model: y = fψ (A, x) = tanh l=1 1[Al = 0]xψ l where 1[ ] is the element-wise indicator function: 1[a] = 1 a is true. x RN x din are randomly generated inputs: x N(0, σ2 x I). ψ l Rdout x din are part of the system model parameters. We summarize the parameters considered in our experiment in Table 3. Learning Latent Graph Structures and their Uncertainty The approximating model family (2) used in the experiment is the same as the data-generating process, with all components of parameter vectors θ and ψ being trainable. The squared MMD discrepancy is defined over Rational Quadratic kernel (Bi nkowski et al., 2018) κ(y , y ) = 1 + y y 2 2 2 α σ2 of hyper-parameters σ = 0.04 and α = 0.5 tuned on the validation set. The model is trained using Adam optimizer (Kingma & Ba, 2014) with parameters β1 = 0.9, β2 = 0.99. Where not specified, the learning rate is set to 0.05 and decreased to 0.01 after 5 epochs. We grouped data points into batches of size 128. Initial values of θ are independently sampled from the U(0.0, 0.1) uniform distribution. C.2. Additional details on the experiments of Section 6.1 We present here additional figures discussed in Section 6.1. Figure 5 reports the values of the learned parameters θ, while Figure 6 the absolute discrepancy from θ . Figure 7 reports the values of the learned parameters θ when considering a graph of 120 nodes. Figure 5. The learned parameters for the latent distribution corresponding to the stochastic adjacency matrix. Figure 6. Absolute error made on the parameters of the latent distribution. Figure 7. Learned θ parameters for a graph with 15K possible edges. C.3. Additional details on the experiments of Section 6.2 We present here additional details discussed in Section 6.2. Fixed perturbed fψ Figures in this paragraph correspond to the experiment where the processing function fψ is fixed on a perturbed version of f . Figures 8 11 correspond to runs with increasing perturbation factor Ψ. Learning Latent Graph Structures and their Uncertainty Figure 8. Learned θij parameters (a) and Absolute Error (b) for maximum perturbation factor Ψ of 10%. Figure 9. Learned θij parameters (a) and Absolute Error (b) for maximum perturbation factor Ψ of 20%. Figure 10. Learned θij parameters (a) and Absolute Error (b) for maximum perturbation factor Ψ of 50%. Figure 11. Learned θij parameters (a) and Absolute Error (b) for maximum perturbation factor Ψ of 80%. Table 4. Network configurations and corresponding convergence results. Layers dimensions Convergence [4, 1] x [4, 1, 1] x [4, 2, 1] [4, 8, 1] [4, 8, 2, 1] [4, 16, 8, 1] [4, 32, 8, 1] [4, 64, 8, 1] [4, 64, 16, 1] [4, 64, 32, 1] [4, 8, 8, 4, 1] Generic GNN as fψ To evaluate our approach in a more realistic setting, we use a generic GNN as fψ. Specifically, we implement GNNs from (Morris et al., 2019) with varying numbers of layers and layer sizes. It is important to note that the GNN implementation includes self-loops, which prevents the diagonal elements from being correctly learned. However, this does not impede our method from learning the remaining edges accurately. Table 4 presents the network configurations and whether they successfully converged to the ground truth distribution. Since diagonal elements artificially inflate the MAE for θ, we consider a model to have converged if the final MAE on θ is less than 0.11. Most of the models successfully converged, except those with high bias. This demonstrates that our method is effective even beyond Assumption 3.1. In Figure 12 we show the learned parameters of P θ A for a randomly extracted run. Misconfigured P θ A Figures 13 and 14 correspond to the experiment where some θij values of P θ A are fixed at incorrect values, while the processing function fψ is fixed to the true one. In the community affected by the perturbation, free θij values tend to be sampled more frequently to compensate for the downsampling imposed by the perturbation. Interestingly, all the edges with at least one edge in the second community (75% of the edges) appear unaffected by the perturbation. Learning Latent Graph Structures and their Uncertainty Figure 12. (a) Learned θij parameters when the parametric processing function fψ is a generic GNN as presented in (Morris et al., 2019) and (b) Absolute Error made with respect to true parameters θ ij. As self-loops are deterministically added by the network, the diagonal elements should not be considered. Figure 13. Learned θij parameters (a) and Absolute Error (b) for misconfigured P θ A Figure 14. Learned θij parameters (a) and Absolute Error (b) for misconfigured P θ A C.4. Additional details on the experiments of Section 6.3 All models have been learned using SFE. Lpoint rely on the following gradient with respect to θ θLpoint = Ex,y [2(EA[ˆy] y )EA[ˆy θ log P θ A(A)]]. For Lliterature 1 , the gradient is rewritten as θLliterature 1,ℓ = Ex,y ,A[(ℓ(ˆy, y ) b) θ log P θ A(A)] with b estimating the expected value Ex,y ,A[ℓ(ˆy, y )]. The second family of loss functions Lliterature 2 focuses on node-level prediction rewriting ℓ(ˆy, y ) as the mean 1 N PN i=1 ℓ(ˆyi, y i ) over the prediction error at each node i. The gradient with respect to θ is then written as θLliterature 2,ℓ = Ex,y EA P θ A "PN i (ℓ(yi, y i ) bi) θ log(P θ A(Ai,:)) N where bi are computed as moving averages of ℓ(yi, y i ). The last family of loss functions (i.e., Lliterature elbo ) requires (i) prior distributions PA(A) and (ii) a standard deviations for P ψ y|x ,A(y ) to be set. We consider the following priors and standard deviations, selecting the combination with the lowest validation loss: (i) For the prior distributions, we assume that each edge is sampled independently according to a Bernoulli distribution. We consider three different prior specifications: Learning Latent Graph Structures and their Uncertainty The first prior is a Bernoulli distribution with parameter p = 0.01 for all edges. The second prior is a Bernoulli distribution with parameter p = 0.5 for all edges. The third prior is defined based on the ground truth graph structure: for edges sometimes present in the ground truth structure (i.e., θ ij = 0), the Bernoulli parameter is p = 0.75, while for edges never present in the ground truth structure (i.e., θ ij = 0), the parameter is p = 0.05. (ii) the standard deviations considered are: {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}. C.5. Real-world experiment To demonstrate that our method learns meaningful graph distributions in real-world settings, we train a neural network on air quality data in Beijing (Zheng et al., 2013). The dataset consists of pollutant measurements collected by sensors in Chinese urban areas across several months. We do not use this dataset as a benchmark because the graph structure provided with the data is based on physical distance, and there is no guarantee that it represents the true underlying structure. The neural network we use consists of a GRU unit for processing each time series, followed by a GNN with a learnable graph structure. Figure 15 shows the graph structure learned by our approach, demonstrating its capability to learn meaningful distributions in real-world settings. Figure 15. Graph structure learned by our approach on Beijing air quality data. The nodes correspond to sensor locations. The thickness of the edges is proportional to the corresponding probability. Map data from Open Street Map. C.6. Compute resources and open-source software The paper s experiments were run on a workstation with AMD EPYC 7513 processors and NVIDIA RTX A5000 GPUs; on average, a single model training terminates in a few minutes with a memory usage of about 1GB. The developed code relies on Py Torch (Paszke et al., 2019) and the following additional open-source libraries: Py Torch Geometric (Fey & Lenssen, 2019), Num Py (Harris et al., 2020) and Matplotlib (Hunter, 2007).