# training_normalizing_flows_from_dependent_data__0f55d1f9.pdf Training Normalizing Flows from Dependent Data Matthias Kirchler 1 2 Christoph Lippert 1 3 Marius Kloft 2 Normalizing flows are powerful non-parametric statistical models that function as a hybrid between density estimators and generative models. Current learning algorithms for normalizing flows assume that data points are sampled independently, an assumption that is frequently violated in practice, which may lead to erroneous density estimation and data generation. We propose a likelihood objective of normalizing flows incorporating dependencies between the data points, for which we derive a flexible and efficient learning algorithm suitable for different dependency structures. We show that respecting dependencies between observations can improve empirical results on both synthetic and real-world data, and leads to higher statistical power in a downstream application to genome-wide association studies. 1. Introduction Density estimation and generative modeling of complex distributions are fundamental problems in statistics and machine learning and significant in various application domains. Remarkably, normalizing flows (Rezende & Mohamed, 2015; Papamakarios et al., 2021) can solve both of these tasks at the same time. Furthermore, their neural architecture allows them to capture even very high-dimensional and complex structured data (such as images and time series). In contrast to other deep generative models such as variational autoencoders (VAEs), which only optimize a lower bound on the likelihood objective, normalizing flows optimize the likelihood directly. Previous work on both generative models and density estimation with deep learning assumes that data points are 1Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany 2University of Kaiserslautern-Landau, Germany 3Hasso Plattner Institute for Digital Health at the Icahn School of Medicine at Mount Sinai, New York. Correspondence to: Matthias Kirchler . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). sampled independently from the underlying distribution. However, this modelling assumption is oftentimes heavily violated in practice. Figure 1 illustrates why this can be problematic. A standard normalizing flow trained on dependent data will misinterpret the sampling distortions in the training data as true signal (Figure 1c). Our proposed method, on the other hand, can correct for the data dependencies and reconstruct the original density more faithfully (Figure 1d). The problem of correlated data is very common and occurs in many applications. Consider the ubiquitous task of image modeling. The Labeled Faces in the Wild (LFW, (Huang et al., 2008)) data set consists of facial images of celebrities, but some individuals in the data set are grossly overrepresented. For example, George W. Bush is depicted on 530 images, while around 70% of the individuals in the data set only appear once. A generative model trained naively on these data will put considerably more probability mass on images similar to George W. Bush, compared to the less represented individuals. Arguably, most downstream tasks, such as image generation and outlier detection, would benefit from a model that is less biased towards these overrepresented individuals. In the biomedical domain, large cohort studies involve participants that oftentimes are directly related (such as parents and children) or indirectly related (by sharing genetic material due to a shared ancestry) a phenomenon called population stratification (Cardon & Palmer, 2003). These dependencies between individuals play a major role in the traditional analyses of these data and require sophisticated statistical treatment (Lippert et al., 2011), but current deeplearning based non-parametric models lack the required methodology to do so. This can have considerable negative impact on downstream tasks, as we will show in our experiments. In finance, accurate density estimation and modeling of assets (e.g., stock market data) is essential for risk management and modern trading strategies. Data points are often heavily correlated with one another, due to time, sector, or other relations. Traditionally, financial analysts often use copulas for the modeling of non-parametric data, which themselves can be interpreted as a simplified version of normalizing flows (Papamakarios et al., 2021). Copulas Training Normalizing Flows from Dependent Data (a) True distribution, sampled independently (b) True distribution, sampled with dependencies (c) Distribution learned by standard normalizing flow, trained on dependent data (d) Distribution learned by normalizing flow adjusted for dependencies, trained on dependent data Figure 1: Example setting on synthetic data sampled with inter-instance dependencies. Training a standard normalizing flow on these data biases the model. Adjusting for the dependencies during training with our modified objective recovers the true underlying distribution. commonly in use, however, are limited in their expressivity, which has led some authors even to blame the 20072008 global financial crisis on the use of inadequate copulas (Salmon, 2009). Many more examples appear in other settings, such as data with geospatial dependencies, as well as in time series and video data. In certain settings from classical parametric statistics, direct modeling of the dependencies in maximum likelihood models is analytically feasible. For linear and generalized linear models, dependencies are usually addressed either with random effects in linear mixed models (Jiang & Nguyen, 2007) or sometimes only by the inclusion of fixed-effects covariates (Price et al., 2006). Recent work in deep learning introduced concepts from random effects linear models into deep learning for prediction tasks such as regression and classification (Simchoni & Rosset, 2021; Xiong et al., 2019; Tran et al., 2020). In federated learning of generative models, researchers usually deal with the break of the non-i.i.d. assumptions with ad hoc methods and without consideration of the statistical implications (Augenstein et al., 2020; Rasouli et al., 2020). These methods also only consider block-type, repeat-measurement dependencies for multi-source integration. To the best of our knowledge, both deep generative models and deep density estimation so far lack the tools to address violations against the independence assumption in the general setting and in a well-founded statistical framework. In this work we show that the likelihood objective of normalizing flows naturally allows for the explicit incorporation of data dependencies. We investigate several modes of modeling the dependency between data points, appropriate in different settings. We also propose efficient optimization procedures for this objective. We then apply our proposed method to three high-impact real-world settings. First, we model a set of complex biomedical phenotypes and show that adjusting for the genetic relatedness of individuals leads to a considerable increase in statistical testing power in a downstream genome-wide association analysis. Next, we consider two image data sets, one with facial images, the other from the biomedical domain, leading to less biased generative image models. In the last application, we use normalizing flows to better model the return correlations between financial assets. In all experiments, we find that adjustment for dependencies can significantly improve the model fit of normalizing flows. In this section we describe our methodology for training normalizing flows from dependent data. First, we will derive a general formulation of the likelihood under weak assumptions on the dependencies among observations. Afterwards, we will investigate two common settings in more detail. 2.1. Background: Likelihood with Independent Observations A normalizing flow is an invertible function t : Rp Rp that maps a p-dimensional noise variable u to a pdimensional synthetic data variable x. The noise variable u is usually distributed following a simple distribution (such as a Np(0, Ip)), for which the density is explicitly known and efficiently computable. By using the change of variables formula, the log-density can be explicitly computed as log(px(x)) = log(pu(u)) log(| det Jt(u)|), where u := t 1(x) and Jt(u) is the Jacobian matrix of t in u. Given a data set x1, . . . , xn, if the observations are independent and identically distributed, the full log-likelihood function readily factorizes into its respective marginal densities: log(px(x1, . . . , xn)) = i=1 log(px(xi)) i=1 log(pu(ui)) log(| det Jt(ui)|). Training Normalizing Flows from Dependent Data The function t is usually chosen in such a way that both the inverse t 1 and the determinant of the Jacobian Jt can be efficiently evaluated, e.g. using coupling layers (Dinh et al., 2017). Therefore, all of the terms in the likelihood can be explicitly and efficiently computed and the likelihood serves as the direct objective for optimization. 2.2. Likelihood with Dependencies Assuming the data points are identically distributed, but not independently distributed, the joint density does not factorize anymore. A model trained on non-independent data but under independence assumptions will hence yield biased results for both density estimation and data generation. We can derive the non-independent setting as follows. Let T : Rn p Rn p be the normalizing flow applied on all data points together, i.e., U = T 1(X) = T 1(x1, . . . , xn) = . . . t 1(xn) X, U Rn p are now matrix-variate random variables. We can still apply the change of variable formula, but on the n p n p transformation T, instead of the p p transformation t: log(p X(X)) = log(p U(U)) log(| det JT (U)|). If T is understood on Rnp instead of Rn p (i.e., we simply vectorize T), it becomes clear that the Jacobian JT is a block-diagonal matrix, Jt(u1) 0 . . . 0 0 Jt(u2) . . . 0 . . . Jt(un) for which the determinant is readily available: det JT (U) = Qn i=1 Jt(ui). In other words, the log-abs-det term in the normalizing flow objective remains unchanged even under arbitrary dependence structure. The density p U(U), however, is challenging and generally not tractable, and we will consider different assumptions on the joint distribution of U. In the most general case, we could assume that each ui is marginally distributed as a U[0,1]p variable, with arbitrary dependence structure across observations. This is a direct extension of standard copulas to matrix-variate variables. As learning general copulas is extremely challenging even in relatively low dimensional settings (Jaworski et al., 2010), we focus in this work only on the equivalent of a Gaussian copula: Assumption 2.1. We assume that the dependency within U can be modeled by a matrix normal distribution MN with independent columns (within observations), but correlated rows (between observations): U MNn,p(0, C, Ip) Nnp(0, Ip C). Here, denotes the Kronecker product. We can model the columns of U with a 0-mean vector and Ip-covarianace, as the normalizing flow t is usually chosen to be expressive enough to transform a Np(0, Ip) into the desired data distribution. We note that this assumption means that we cannot model all forms of latent dependencies so it constitutes a trade-off between expressivity and tractability. Now we can state the full likelihood in the non-i.i.d. setting: log(p X(X)) i=1 log(| det Jt(ui)|) np 2 log(det(C)) 1 2tr(U C 1U). (1) 2.3. Specific Covariance Structures We investigate different assumptions on the covariance structure in the latent dependency model. The most general case is a fully unspecified covariance matrix, e.g. parametrized as the lower-triangular Cholesky decomposition of its inverse, C = L 1L with n(n + 1)/2 parameters. In this case, the determinant can be efficiently computed, as det(C) = det(L 1L ) = det(L 1)2 = Qn i=1(L 1)2 i,i. Matrix products with C 1 can also be evaluated reasonably fast. However, this parametrization requires optimizing O(n2) additional parameters, which is unlikely to yield useful estimates and may be prone to overfitting. Instead, we consider two different assumptions on C that are very common in practice and give a reasonable trade-off between expressivity and statistical efficiency. 2.3.1. KNOWN AND FIXED COVARIANCE MATRIX In many settings, side information can yield relationship information, given in the form of a fixed relationship matrix G. The covariance matrix then becomes C = λIn + (1 λ)G with only parameter λ [0, 1] to be determined. This setting is commonly assumed for confounding correction in genetic association studies, where G is a genetic relationship matrix (where the entries are pairwise genetic relationships computed from allele frequencies (Lippert et al., 2011)) or based on pedigree information (e.g., a parent-child pair receives a relationship coefficient of 0.5 and a grandparent-grandchild pair of 0.25 (Visscher Training Normalizing Flows from Dependent Data et al., 2012)). Similarly, for time-related data, we can define relationship via, e.g., a negative exponential function: Ci,j = exp( γ(ti tj)2), where the hyperparameter γ > 0 is a time-decay factor and ti and tj are the measurement time points of observations i and j, respectively. More generally, G itself can again be a mixture of multiple relationships G = PR r=1 Gr, where Gr denote multiple sources of relatedness. In this work, we consider G to be fully specified and only estimate λ. If the sample size is moderate (say, below 50k), an efficient approach to optimizing λ (Lippert et al., 2011) consists of first computing the spectral decomposition of G = QΛQ (with diagonal Λ and orthogonal Q) and noticing that λIn + (1 λ)G = Q(λIn + (1 λ)Λ)Q . Then, the log-determinant and the trace are log(det(C)) = i=1 log(λ + (1 λ)Λi,i) tr(U C 1U) = tr((Q U) (λIn + (1 λ)Λ) 1Q U). The rotation matrix Q makes mini-batch estimation of the trace term inefficient, as Q will either mix U across batches or requires a full re-evaluation of Q(λIn + (1 λ)Λ) 1Q after each update to λ, i.e., in every mini-batch. Instead, we optimize the parameters of the normalizing flow and λ in an alternating two-step procedure, see Section 2.4.2. Note that the main additional cost of this procedure, the spectral decomposition of G, is independent of λ and only needs to be performed once for a given relationship matrix G. For larger sample sizes, there still exist practical algorithms for estimating the variance component (Loh et al., 2015). In practice, G is also often sparse or can be approximated sparsely (e.g., by setting all elements with absolute value below a fixed threshold to 0). This can greatly accelerate parameter estimation and is usually accurate enough in practice (Jiang et al., 2019). More generally, different matrix structures may allow for additional speed-ups, but we defer this investigation to future work. 2.3.2. BLOCK-DIAGONAL, EQUICORRELATED COVARIANCE STRUCTURE In the next setting, we consider a block-diagonal covariance matrix C with equicorrelated correlation matrices Ci Rni ni as blocks: C1 0 . . . 0 0 C2 . . . 0 . . . CN 1 ρi . . . ρi ρi 1 . . . ρi . . . 1 with ρi (0, 1) (we ignore the case of potentially anticorrelated blocks). In other words, there is no dependence between blocks, and there is a constant dependence within blocks. We assume that the block structure is known ahead and we only need to find the parameters ρi. For each block there is either no (ni = 1) or only one (ni > 1) parameter to be learned. The assumption of equicorrelated blocks is reasonable in settings with repeat measurements of identical objects or individuals. E.g., in a facial image data set, certain individuals may have multiple images. This setting is similar to the setting of high-cardinality categorical features in prediction models (Simchoni & Rosset, 2021). Both the determinant and the inverse of each block can be efficiently computed ((Tong, 2012), Prop. 5.2.1 & 5.2.3): det(Ci) = (1 + (ni 1)ρi)(1 ρi)ni 1 (C 1 i )j,k = 1+(ni 2)ρi (1 ρi)(1+(ni 1)ρi) if j = k ρi (1 ρi)(1+(ni 1)ρi) otherwise. 2.4. Optimization 2.4.1. MINI-BATCH ESTIMATION The full likelihood in Equation 1 can be computed explicitly but does not lend itself easily to stochastic optimization with mini-batches. Note that the log-abs-det term decomposes nicely into independent observations and the next two terms are independent of the observations. Only the trace term is problematic for mini-batch estimation, so we propose an unbiased stochastic estimator for it. Proposition 2.2. Given a mini-batch of size b 2 and ξ {0, 1}n a variable indicating batch inclusion (i.e., xi is in batch iff ξi = 1; Pn i=1 ξi = b) and A := C 1, the stochastic trace estimator i=1 ξi Ai,iu i ui + 2n(n 1) i 0 will have almost identical run times as the baseline method with ρi = 0, as the base distribution likelihood evaluation is not a bottleneck. Since all ρi are identical, there is virtually no additional memory requirement. However, as the full network needs to be trained for each of the Mgrid grid values tested, the grid evaluation scheme takes roughly Mgridtbaseline Joint optimization In this setting, N (number of blocks) parameters ρi need to additionally be estimated and stored in memory, but in all cases considered in this paper this was strongly dominated by the number of parameters in the model (e.g., in LFW, the normalizing flow model had 90M parameters, but only a few thousand extra parameters for the individuals. For very slim models and a very large number of blocks, this relationship may change. C.2. Fixed Covariance For the fixed-covariance case, a full spectral decomposition is necessary prior to training, which is (in practice) an O(n3) operation. It also requires storing the full spectral decomposition in memory. Standard linear algebra libraries used in Py Torch or Numpy & Sci Py only support spectral decompositions up to several 10k and oftentimes become unreliable beyond that. Therefore, using fixed covariance schemes is infeasible for larger-scale problems using out-of-the-box software. Grid optimization For the fixed grid schedule, mini-batch estimation requires quadratic time in the size of the mini-batch, due to the stochastic trace estimator in Equation 2. However, for batch-sizes used in our settings, this was still dominated by the neural architecture shared with the baseline flow architecture. The log-det-Jacobian can be cached and the remaining parts are identical to the baseline flow, so each individual epoch has very similar time requirements to the baseline model. Analogously to the equicorrelated blocks grid optimization, we still need to perform Mgrid runs, although the same spectral decomposition can be used for all those runs. Alternating optimization The main training stage for the flow parameters has identical computational considerations as the grid optimization procedure. However, for optimizing ˆλ in every other training stage, first the full data set needs to pushed through the normalizing flow and then rotated with the orthogonal matrix Q from the spectral decomposition. Despite this, the alternating training procedure was dominated by the original spectral decomposition and the main training stage of the flow. D. ADNI Images Training Normalizing Flows from Dependent Data (a) Train images (b) Baseline (c) Grid Search Figure 6: Random samples of ADNI train images and images generated by the normalizing flow models.