# implicit_generative_copulas__ed6f1466.pdf Implicit Generative Copulas Tim Janke, Mohamed Ghanmi, Florian Steinke Energy Information Networks and Systems TU Darmstadt, Germany {tim.janke},{florian.steinke}@tu-darmstadt.de Copulas are a powerful tool for modeling multivariate distributions as they allow to separately estimate the univariate marginal distributions and the joint dependency structure. However, known parametric copulas offer limited flexibility especially in high dimensions, while commonly used non-parametric methods suffer from the curse of dimensionality. A popular remedy is to construct a tree-based hierarchy of conditional bivariate copulas. In this paper, we propose a flexible, yet conceptually simple alternative based on implicit generative neural networks. The key challenge is to ensure marginal uniformity of the estimated copula distribution. We achieve this by learning a multivariate latent distribution with unspecified marginals but the desired dependency structure. By applying the probability integral transform, we can then obtain samples from the high-dimensional copula distribution without relying on parametric assumptions or the need to find a suitable tree structure. Experiments on synthetic and real data from finance, physics, and image generation demonstrate the performance of this approach. 1 Introduction Many approaches are available to reliably model univariate distributions of real-valued variables. One can use various parametric distributions or employ flexible, non-parametric methods, e.g., by using empirical distribution functions, kernel density estimation (KDE), or quantile regression [22]. Modeling distributions in higher dimensions is a much harder task. Parametric approaches using, e.g., Gaussian distributions, are common but inflexible. Many non-parametric approaches become infeasible in higher dimensions due to the curse of dimensionality [18]. Recent successes in modeling multivariate distributions have been achieved with deep neural networkbased likelihood-free implicit generative models [28], such as Generative Adversarial Networks [13] (GANs) and Generative Moment Matching Networks (GMMNs) [10; 24]. Copulas are a tool to decouple the modeling of the univariate marginal distributions from modeling the high-dimensional joint dependency structure [19]. This allows to obtain high-quality marginal models with the mentioned univariate techniques. Moreover, the (conditional) marginals can be tailored to the problem at hand and many fields have developed sophisticated domain-specific univariate models for this task, e.g., for probabilistic forecasts in finance [4], weather [36], or energy [17]. These models can then be combined with a suitable copula structure to enable the simulation of multivariate quantities, e.g., [34; 29; 41]. In the machine learning literature copulas have been used to increase the flexibility of Bayesian networks [11], to model multi-agent coordination in reinforcement learning [43], or for image generation via the Vine Copula Autoencoder [40]. Another popular application of copulas is synthetic tabular data generation [33; 27]. Note that in many of these applications, the main goal is to sample from a multivariate distribution and the copula is used as a building block for a generative model. 35th Conference on Neural Information Processing Systems (Neur IPS 2021), virtual. High-dimensional copula distributions are commonly modeled via parametric structures such as the Gaussian or the student-t copula [8]. [25] propose to enhance the subclass of Archimedean copulas by learning the generator functions via deep neural networks. However, the family of Archimedean copulas is of limited use in higher dimension as it makes restrictive symmetry assumptions. A more flexible, semi-parametric state-of-the art approach is to use vine copulas which are built from trees of bivariate pair-copulas [1]. These pair-copulas can again be parametric, such as the Gumbel or Clayton copula, or non-parametric, e.g., by using bivariate kernel density estimation [32]. Directly applying implicit generative modeling to the task of estimating copula distributions is not straight-forward since copula distributions are required to have uniform marginals. Such an approach has been proposed by [23; 16], but without ensuring the uniformity property, at least not for finite sample sizes when the model is not expected to exactly fit the true copula distribution. Ensuring the marginal uniformity of the learned copula distribution is important since deviations from uniformity will result in unwanted alterations of the marginal distributions in the data space. In this paper we show how to design and train implicit generative copula (IGC) models to match the dependency structure of given data. A key challenge is to ensure the marginal uniformity of the estimated distribution. We achieve this through first learning a latent distribution with unspecified marginals but the same dependency structure as the training data. We then obtain the desired copula model by applying the probability integral transform component-wise to the latent distribution. The IGC model and the data distribution are matched in copula space using the energy distance [39]. During training, the probability integral transform is approximated through a differentiable softrank layer which allows to use gradient-based methods. Our contributions We propose the first universal, non-parametric model for estimating high-dimensional copula distributions with guaranteed uniformity of the marginal distributions. We show how flexible likelihood-free implicit generative models based on deep neural networks can be trained for this task through the use of a differentiable softrank layer. Compared with the state-of-the-art semi-parametric vine copula approach, we demonstrate similar or improved performance for different tasks. We present the proposed copula model in Section 2. Model training is described in Section 3. Section 4 shows experimental results for synthetic and real data from finance, physics, and image generation. We conclude in Section 5. 2 Proposed IGC Model We introduce our IGC model following the schematic shown in Figure 1. Figure 2 provides an exemplary application for a bivariate two-component Gaussian mixture data distribution. Copula Basics The vector-valued continuous random variable X RD represents the data source of our task. We denote its distribution by P and the cumulative distribution function (cdf) of X by FX(x). Moreover, let FXd(xd) be the cdf of the marginal distribution of Xd, the d-th component of X, for d = 1, . . . , D. Each sample vector x RD can be mapped to a vector u [0, 1]D by defining as ud the value of xd with respect to the corresponding marginal cdf of Xd, i.e., ud = FXd(xd) for all d = 1, . . . , D. This operation is called the probability integral transform (PIT). In the copula literature, values obtained by the PIT are called pseudo-observations [19]. The random variable U follows a so-called copula distribution P C by construction, i.e., the distribution is defined on the unit space [0, 1]D and its marginals are uniform. The joint cdf of U is typically called the copula, which we denote by FU(u) here. Sklar s theorem states that such such a copula function exists for all random variables X and it holds that FX(X) = FU(FX1(X1), . . . , FXd(XD)), i.e., any multivariate distribution can be expressed in terms of its marginals and its copula [38]. IGC Model In this work, we aim at defining a flexible, non-parametric model for copula distributions QC θ in high dimensions. Such a model can either be used to (approximately) represent the 𝑍~𝑁(0, 𝐼) 𝑦= 𝒈𝜽(𝒛) 𝑌~𝑄𝜽 𝑉~𝑄𝜽 𝐶 𝑣𝑑= 𝐹𝑌𝑑(𝑦𝑑) 𝑢𝑑= 𝐹𝑋𝑑(𝑥𝑑) 𝑈~𝑃𝐶 latent space unit space data space noise Figure 1: Illustration of the implicit generative copula (IGC) concept: We use a generative neural network gθ with learnable parameters θ to map samples from a multivariate standard Normal distribution to samples of the latent distribution Qθ. The marginal cdfs FYd of this distribution are then used to generate samples of the copula distribution QC θ . The model is trained to match the copula distribution of the real data P C using the energy distance. 30 25 20 15 10 5 0 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 6 4 2 0 2 4 6 Figure 2: Contour plots and marginal densities of all data and model distributions mentioned in Figure 1 for the example of a two-dimensional mixture of Gaussians data distribution P. Contour plots are derived via KDE from a set of 10000 samples. copula distribution P C of the unit space vectors U or the distribution P of the original data vectors X, in which case the U vectors have to be transformed component-wise with the inverse cdfs of the components Xd, i.e., xd = F 1 Xd (ud). A flexible model class for learning high-dimensional probability distributions is the class of implicit generative models [28]. Here, latent random variables with a simple and known distribution are transformed via a parameterizable mapping, typically a deep neural network, to a complex distribution that approximates the distribution of the training data. A straight-forward application of this framework to model copula distributions, as done in [16] and [23], is difficult. The output of the generative model can be guaranteed to lie in [0, 1]D by applying an appropriate output layer, e.g., using a sigmoid function. However, guaranteeing the desired uniformity of the marginal distributions is not straight-forward. We have first experimented with additional training losses that penalize deviations from marginal uniformity. However, a simpler approach without any hyper-parameters for weighting additional cost terms and with a guarantee of marginal uniformity is the following two-step procedure. In the first step, we model the distribution of the vector-valued latent random variable Y RD. To this end, we start with random variables Z RK from a simple base distribution that we choose as zero mean, unit variance Gaussian here. The tuneable map gθ : RK RD with parameters θ is then used to transform the noise samples z to the latent samples y as y = gθ(z). (1) We denote the resulting probability distribution of Y by Qθ and the marginal cdfs by FYd(yd). In a second step, we transform the latent variables Y into unit space vectors V. To this end, we define component-wise for d = 1, . . . , D vd = FYd(yd). (2) This transformation step is analogous to going form X to U. It guarantees both that the sample values vd from the model lie in the unit interval [0, 1] and that they are uniformly distributed, regardless of the distribution Qθ. We refer to the distribution of V as the model copula distribution and denote it by QC θ . Interpretation Note that the distributions Qθ and P might be very different. More precisely, the mapping (1) can result in arbitrary marginal distributions but an optimal model of the distribution P would have P C = QC θ . (3) Given a sufficiently rich function class for gθ any distribution of Y can be modeled with small error [26]. This statement naturally extends to universal approximation properties for the dependency structure of Y, i.e., the unit space random vectors V. The approximation of the data copula distribution P C with our model copula distribution QC θ will thus be arbitrarily close if the generator function gθ is flexible enough and if the optimal θ is selected. 3 Training Procedure Problem statement We aim at empirically estimating an IGC model QC θ from data, i.e., we want to determine the optimal parameter vector θ given a set of N training samples x1, . . . , x N from X. The underlying target is to optimally match the copula model QC θ to the (typically unknown) copula distribution P C from which the samples were generated. Parameter estimation We first transform the given data samples into unit space vectors in [0, 1]D, before starting the actual training procedure. Since we typically do not know the exact marginal cdfs FXd of the components Xd, we estimate them from the given data. To this end, we use the empirical cdfs separately for each component and define for n = 1, . . . , N and d = 1, . . . , D, i=1 1[xi d xn d]. (4) Here, 1[A] is the indicator function of event A, i.e., it is 1 iff A is true. Note that this is an unbiased and consistent estimator for the true cdf [42]. We then aim at matching the copula distribution QC θ to the unit space vector samples u1, . . . , u N. We base the inference of our likelihood-free model on the energy distance ED2 [39] between distributions P C and QC θ , ED2(P C, QC θ ) = 2E U V E U U E V V . (5) Here, U, U and V, V are independent copies of a random vector with distribution P C and QC θ , respectively. It holds that ED2(P C, QC θ ) = 0 if and only if P C = QC θ [39]. In practice, we use a sample-based approximation of the energy distance. We draw M samples v1(θ), . . . , v M(θ) from QC θ and minimize the loss function LED2(θ) = 1 NM m=1 un vm(θ) 2 1 2M(M 1) m =1 vm(θ) vm (θ) 2. (6) Since P C is independent of θ, the second term of (5) does not need to be included. Note that the energy distance is an instance of the more general maximum mean discrepancy (MMD) measure [15; 37]. We also experimented with MMD losses based on the Gaussian kernel but found the energy distance to be more robust as it does not require to tune the kernel bandwidth. To generate samples from our copula model QC θ , we first draw samples y1, . . . , y M from the implicit generative model Qθ. To obtain vectors in unit space, we then again require the cdfs FYd(yd) of the marginal distributions of the components Yd of Y, d = 1, . . . , D. However, these functions are not known during model training and will likely change after each gradient step. Hence, we again use the empirical distribution functions to define for m = 1, . . . , M and d = 1, . . . , D, vm d (θ) = 1 j=1 1[yj d(θ) ym d (θ)]. (7) This provides us with an unbiased estimate of the current cdf during training. However, the indicator operation in (7) is not differentiable and hence will not allow the gradients to flow through this operation. Therefore, during training, we replace (7) with a softrank layer based on a scaled sigmoid function vm d (θ) = 1 1 + exp(α(ym d (θ) yj d(θ))) where α is a scaling constant [35]. For sufficiently large α, (8) provides a close approximation to the empirical marginal cdfs at the current training step. A larger M will result in a finer approximation of the true marginal cdfs but comes with the increased cost for computing the ranks which requires O(M 2) operations for all samples. Sampling from the trained model Once we have completed training and thus have fixed θ, we have also fixed the component-wise cdfs FYd(yd). Now, the operation to estimate FYd(yd) does not need to be differentiable anymore and we can chose any available method to estimate the univariate marginal distributions based on samples from Qθ. We choose to simply draw a very large set of samples from Qθ and then use the empirical marginal cdfs, i.e., for each sampled value yt d we store the corresponding value ut d according to (4), t = 1, . . . , T. Given a new realization of Yd, we obtain its approximate cdf value ˆFYd(yd) by interpolation of the stored values. Other more compact approximations like univariate kernel density estimation would also be feasible. In sum, we obtain a sample vi from QC θ at test time by first sampling a realization zi from N(0, I), transform it via yi = gθ(zi), and then apply the component-wise PIT approximation to obtain vi = [ ˆFY1(yi 1), . . . , ˆFYd(yi D)]. Given estimates for the component-wise marginal cdfs of the data ˆFX1, . . . , ˆFXd we can derive a sample si in data space via si = [ ˆF 1 X1 (vi 1), . . . , ˆF 1 Xd (vi D)]. 4 Experiments In the following we empirically demonstrate the capabilities of IGC models on a series of experiments on synthetic and real data with increasing complexity. Additional experiments can be found Appendix B. Implementation For all experiments except the image generation task, we use a fully connected neural network with two layers, 100 units per layer, Re LU activation functions, and train for 500 epochs. For the image generation experiment we use a three layer, fully connected neural network with 200 neurons in each layer, and train for 100 epochs. In all cases we train with a batch size of Nbatch = 100 and generate M = 200 samples from the model per batch. The number of noise distributions is set as K = 3D and T = 106. We use the Adam optimizer [20] with default parameters. All experiments besides the training of the autoencoder models were carried out on a desktop PC with a Intel Core i7-7700 3.60Ghz CPU and 8GB RAM. For the training of the autoencoders we used Google Colab [14]. Training times for all experiments are in the range of a few minutes except for the image generation task. Details are provided in Appendix B. We use tensorflow with the Keras API for the neural networks [2; 7]. For copula modeling, we use pyvinecopulib [31] in Python and the R package kdecopula [30]. Our code is available from https://github.com/Tim CJanke/igc. Evaluation We evaluate the models by comparing the distance between the learned and the true copula. More specifically, we use the integrated squared error (ISE) in unit space, i.e., [0,1]D(FU(w) FV(w))2dw 1 n=1 (FU(un) FV(un))2, (9) where FU is the joint cdf of U, the true copula, and FV the joint cdf of the learned model. Since analytical integration is not possible, we approximate the integral with an empirical sum and use the data vectors u1, ..., u N for this purpose. The copula function FU is not known for real data. Instead, we use the empirical cdf of the unit space data vectors in (9), i.e., for w [0, 1]D we set d=1 1[un d wd]. (10) Figure 3: Box plot for the integrated squared error (lower is better) for 25 runs of fitting bivariate copulas to training data from a Student-t, Gumbel, Clayton, and Gaussian mixture copula. Due to the curse of dimensionality this approximation of the joint cdf will have a coarse structure in high dimensions. This is why we only use the values at the data vectors u1, ..., u N in (9). For parametric copula models the distribution function FV is analytic. For the non-parametric models, including IGC, we again resort to approximation. We draw 105 samples from the copula distribution model and use their empirical cdf in unit space in (9). We found that this procedure provided reliable and stable estimates for the underlying copula in dimensions D 5. 4.1 Synthetic data Learning bivariate parametric copulas We first show that IGC models are able to emulate bivariate parametric copulas. We compare our IGC approach to two non-parametric copula density estimation techniques, kernel density estimation with beta kernels (BETA) [6] and the transformation local likelihood estimator (TLL) [12]. These are state-of-the art non-parametric estimators for paircopulas [30]. The compact support of the beta distribution on [0, 1] is convenient for copula density estimation. The idea of the TLL approach is to first transform the data using the inverse normal CDF such that the data is supported on R2. Then the density is estimated via local regression using linear (TLL1) or cubic polynomials (TLL2) on a fixed grid of points. Finally the estimated density is transformed back to unit space. Additionally we report the performance of a parametric model. In this approach a copula family is selected from a set of copulas based on the Bayesian information criterion (BIC). This set comprises the Independence, Gaussian, Student-t, Clayton, Gumbel, Frank, Joe, BB1, and BB7 copula. Parameters are then estimated via maximum likelihood. We run experiments for data sets generated from a Student-t, a Gumbel, and a Clayton copula as well as the copula resulting from a two-component Gaussian mixture distribution as shown in Figure 2. For each one of these, we sampled a random parameter set and then generated 1000 training samples from the resulting model. We repeated each experiment 25 times. Appendix B contains the details on the considered parameter ranges and the exact sampling procedure. Figure 2 shows the data and the model with intermediate steps for one instance of the Gaussian mixture case. The fitted copula matches the data distribution in unit space very well. Figure 3 presents aggregated ISE results for the different setups. The IGC model shows comparable performance scores to TLL1 and TLL2 for the Student-t, Clayton, and Gumbel copulas, and better performance than the BETA approach. For the more complex Gaussian mixture test case, IGC shows superior performance and a much lower variance than all baseline methods. Notably, the parametric approach with BIC-based model selection results in large variance in accuracy for the Clayton and Gumbel data, most likely because the wrong copula family is selected in some cases. Learning multivariate vine copulas We now turn to the problem of estimating copulas in dimensions D > 2. To this end, we conduct a similar experiment as before using 5-dimensional vine copulas as data generating distribution. For each of 25 repetitions, we first sample a random tree structure using pyvinecopulib s RVine Structure.Simulate() method. Then, we randomly assign parametric pair-copulas with random parameters to each edge in the tree. The considered families of bivariate copulas are Independence, Gaussian, Student-t, Clayton, Gumbel, Frank, Joe, BB1, and BB7. Further details are available in Appendix B. We generate 5000 samples from the resulting vine copula model and use these to train our IGC model as well as the benchmark models. As benchmarks, we use a vine copula model with TLL2 pair-copulas only and a vine copula model which can select all parametric copula families named above as well as the TLL2. Additionally, we estimate a parametric Gaussian copula. The parameters are estimated via maximum likelihood and the copula families are selected using the BIC. The selection of the vine structure is based on the Dissmann algorithm [9]. We evaluate all models with the ISE at 10000 data points sampled from the true model. The results of the simulation study are presented in Figure 4. The IGC model has the lowest mean ISE (0.0697) followed by the TLL2-Vine model (0.2093), while the latter has a lower median ISE (0.0339 vs. 0.045). This is the result of some large outliers of the ISE for the vine models that do not occur for the IGC model. These are most likely caused by a poorly selected vine structure. Interestingly, the results specifically imply that the data is on average better modeled by the IGC model than by the vine approach although this model class entails the true data generating model. However, recovering the true model does not seem to be an easy task. 4.2 Real data Figure 4: Box plot of the ISE results for 25 runs of fitting data from a 5D vine copula Exchange rates Copulas are a popular tool in financial risk management as they can be used to estimate the distribution of the multivariate returns over different assets [34] under the assumption of a stationary copula. We consider a data set of size N = 5844 that contains 15 years of daily exchange rates between the US-Dollar and the Canadian Dollar, the Euro, the British Pound, the Swiss Franc, and the Japanese Yen. The data was obtained from the R package qrm_data. We preprocess the data to filter out the effects of temporal dependencies. To this end, we fit an AR(1)-GARCH(1,1) [4] process with Student-t innovations to the time series of the daily returns. We then obtain the standardized residuals from the AR-GARCH models and transform these observations to the unit space using the empirical cdfs. Figure 5a shows a kernel density estimate for two selected dimensions of the resulting data set in unit space, namely the US-Dollar/Euro and US-Dollar/Pound exchange rates. While being non-trivial and multi-modal, the distribution is strongly concentrated along the diagonal. We then estimate a Gaussian copula, a vine copula, a vine copula with only TLL2 pair-copulas, a GMMN as proposed by [16], and our IGC model for this data. The GMMN model has the same loss function, architecture, and hyper parameters as the IGC model but uses a sigmoid activation in the final layer. To ensure a test data set of appropriate size, we use a 5-fold cross-validation scheme where 20% of the data is used for training and 80% for testing. For the IGC and the GMMN model we additionally use five different random initializations per fold for the neural network weights, i.e., we report means and standard deviations from 25 values for the IGC and the GMMN model and 5 values for the other methods. The results are presented in Table 1. The IGC model clearly outperforms the Gaussian baseline and also the GMMN model. However, the vine copula models show lower average ISEs. This might be due to the symmetric nature of the data which can be approximated well by Student-t and TLL pair-copulas. Table 1: ISE mean and standard deviation for exchange rate and magic data from 5 fold CV and 5 random NN weight initializations per fold (lower is better). exchange rates magic Gauss 0.9196 0.1024 0.2415 0.0167 Vine 0.2791 0.0290 0.0588 0.0055 Vine-TLL2 0.2457 0.0383 0.0588 0.0055 GMMN 0.5791 0.2638 0.08065 0.0286 IGC (ours) 0.3829 0.1385 0.0345 0.0105 MAGIC Gamma Telescopes data In order to test our approach on a more complex dependency structure, we consider the MAGIC (Major Atmospheric Gamma-ray Imaging Cherenkov) Telescopes data set available from the UCI repository (https://archive.ics.uci.edu/ml/ datasets/MAGIC+Gamma+Telescope). This data was also used for benchmarking non-parametric copula estimation techniques in [32]. We only consider the observations classified as gamma and the 5 variables f Length, Width, f Conc, f M3Long, f M3Trans. The size of the data set is N = 12332. We use the empirical cdfs to transform the 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5: Exemplary bivariate KDE estimates of the copula densities from the real data sets: (a) Exchange rates EUR/USD-GBP/USD, (b) MAGIC f Length-f M3Trans, (c) and (d): Fashion MNIST Autoencoder latent space dimensions 1-2 and 2-3 observations to the unit space. Figure 5b showcases the dependency structure of the data for two selected data dimensions. The observed structures are highly asymmetric. We again fit the data with a Gaussian copula, a vine copula with all available pair-copulas as above, a vine copula with TLL2 pair-copulas, and a GMMN model with a sigmoid output layer [16] as benchmarks and use the same 5-fold CV strategy as before, i.e., in each fold we use 20% as training data, 80% as test data, and report the average ISE. The results are presented in Table 1. The IGC model achieves the lowest average ISE. Again the GMMN model shows a substantially larger mean ISE with a much larger standard deviation. As only TLL2 pair-copulas were selected by the BIC criterion, the scores for both vine models are the same. The Gaussian copula is clearly not suited for such complex data as the average ISE is 5 times larger than for the other approaches. Copula Autoencoders [40] introduced the Vine Copula Autoencoder, a generative model which uses vine copulas for ex-post density estimation of the latent space of a trained autoencoder. In a first step, an autoencoder is trained on a data set to learn a low dimensional representation of the data. After training, the encoder part of the network is used to map the training data to the autoencoder s latent space representation. Next, the uni-variate marginal distributions are estimated, e.g., by using the empirical cdfs or kernel density estimation, and the compressed data is mapped to unit space. After fitting a copula model to the observations in unit space, one can sample from the fitted copula model, apply the inverse marginal cdfs, and map the simulated data back to the image space using the decoder network of the autoencoder in order to generate new data samples. Table 2: MMD scores for the Fashion MNIST test set (lower is better) image latent copula Indep 0.01912 0.00722 0.00373 Gauss 0.00619 0.00209 0.00087 Vine 0.00674 0.00131 0.00079 GMMN 0.00392 0.00341 0.00073 IGC 0.00426 0.00114 0.00069 VAE 0.01316 In the following, we present results for the Fashion MNIST [44] data set. We train a convolutional autoencoder on the entire training data of 60000 samples with a latent space of dimension 25. Details on the architecture are found in Appendix B. After training the autoencoder, we estimate the empirical cdfs of the compressed data using the training data. See Figures 5c and 5d for exemplary visualizations of the resulting bivariate data densities in unit space. Notably, these distributions are more complex than the ones from the previous experiments. We fit this data with a Gaussian copula, a vine copula with TLL2 pair-copulas, a GMMN with sigmoid output layer [16], and an IGC model. We also test an independence copula, i.e., we assume independence over the latent space. Additionally we report results for a standard variational autoencoder (VAE) [21] with the same architecture. Exemplary samples from all models and the test set are presented in Figure 6. Sampling with the independence copula, i.e., assuming no dependency structure, leads to images with many artifacts. The Gaussian copula produces better images, but still produces some artifacts. Samples from the vine copula or the IGC model are comparable in quality. They show smaller details and few implausible artifacts. The images generated by the VAE are blurry and show much less variation in pixel intensity than the test data. (a) independence copula (b) Gaussian copula (c) vine copula (d) IGC (ours) (f) test data Figure 6: Generated image samples from different copula models on top of a pre-trained autoencoder as well as from a standard variational autoencoder. The latent space has 25 dimensions. In 25 dimensions it is not feasible anymore to compute and compare empirical cdfs as required for ISE scoring. We thus resort to the MMD [15] for the numerical evaluation since it is commonly used as a simple and robust measure for evaluating image generation models [46]. We generate 10000 images from each model and compare those to the 10000 test set images by computing the MMD with a Gaussian kernel. We use the bandwidths 1.5, 350, and 8 for the latent unit space, the latent data space, and the image space, respectively. The bandwidths were selected based on the median heuristic proposed in [15]. The results are found in Table 2. We provide p-values for these results using the test proposed by [5] in Appendix B. The IGC and GMMN models perform better than all other models for the latent unit space distribution as well as the image space. Interestingly the GMMN model shows a relatively large MMD value for the latent data space. This could be caused by deviations from marginal uniformity of the learned copula distribution which alter the marginal distributions in the data space. 5 Conclusion We have proposed the first fully non-parametric model framework for copulas in higher dimensions that guarantees uniformity of the marginal distributions. The proposed IGC approach, which is based on an implicit generative step and a differentiable ranking transformation, is structurally simple. Yet, if the generator class is sufficiently complex, any copula dependency structure can be modeled. The model can be well implemented with standard deep learning frameworks. For various data sets we have shown a modeling performance on par or above other state-of-the-art approaches, especially, the vine copula approach. IGC models should be further investigated in various ways. First, it would be straight-forward to condition the model on external factors, either by including such values into the input of the generator network or by reparameterization of the noise distributions. This would allow to describe contextdependent changes of the dependency structure. Second, many more applications for copulas should be examined, e.g., synthetic tabular data generation [45]. Third, the ranking operation could probably be sped up from O(M 2) to O(M log M) using ideas from [3]. Finally, the use of adversarial training schemes could also be investigated. Acknowledgments and Disclosure of Funding This work has been performed in the context of the LOEWE center emergen CITY. [1] K. Aas, C. Czado, A. Frigessi, and H. Bakken. Pair-copula constructions of multiple dependence. Insurance: Mathematics and economics, 44(2):182 198, 2009. [2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. [3] M. Blondel, O. Teboul, Q. Berthet, and J. Djolonga. Fast differentiable sorting and ranking. In International Conference on Machine Learning, pages 950 959. PMLR, 2020. [4] T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3):307 327, 1986. [5] W. Bounliphone, E. Belilovsky, M. Blaschko, I. Antonoglou, and A. Gretton. A test of relative similarity for model selection in generative models. In 4th International Conference on Learning Representations, ICLR 2016, 2016. [6] A. Charpentier, J.-D. Fermanian, and O. Scaillet. The estimation of copulas: Theory and practice. Copulas: From theory to application in finance, pages 35 64, 2007. [7] F. Chollet et al. Keras. https://keras.io, 2015. [8] S. Demarta and A. J. Mc Neil. The t copula and related copulas. International Statistical Review, 73(1):111 129, 2005. [9] J. Dissmann, E. C. Brechmann, C. Czado, and D. Kurowicka. Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis, 59: 52 69, 2013. [10] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 258 267, 2015. [11] G. Elidan. Copula bayesian networks. In Advances in Neural Information Processing Systems, volume 23, 2010. [12] G. Geenens, A. Charpentier, D. Paindaveine, et al. Probit transformation for nonparametric kernel estimation of the copula density. Bernoulli, 23(3):1848 1873, 2017. [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, 2014. [14] Google Colaboratory. URL https://colab.research.google.com/. [15] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. [16] M. Hofert, A. Prasad, and M. Zhu. Quasi-random sampling for multivariate distributions via generative neural networks. Journal of Computational and Graphical Statistics, pages 1 24, feb 2021. [17] T. Hong, P. Pinson, S. Fan, H. Zareipour, A. Troccoli, R. J. Hyndman, et al. Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. International Journal of Forecasting, 32(3):896 913, 2016. [18] J.-N. Hwang, S.-R. Lay, and A. Lippman. Nonparametric multivariate density estimation: a comparative study. IEEE Transactions on Signal Processing, 42(10):2795 2810, 1994. [19] H. Joe. Dependence modeling with copulas. CRC press, 2014. [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, 2015. [21] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, 2014. [22] R. Koenker and G. Bassett Jr. Regression quantiles. Econometrica: Journal of the Econometric Society, pages 33 50, 1978. [23] N. A. Letizia and A. M. Tonello. Segmented generative networks: Data generation in the uniform probability space. IEEE Transactions on Neural Networks and Learning Systems, 2020. [24] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718 1727. PMLR, 2015. [25] C. K. Ling, F. Fang, and J. Z. Kolter. Deep archimedean copulas. In Advances in Neural Information Processing Systems, volume 33, 2020. [26] Y. Lu and J. Lu. A universal approximation theorem of deep neural networks for expressing probability distributions. In Advances in Neural Information Processing Systems, pages 3094 3105, 2020. [27] D. Meyer, T. Nagler, and R. J. Hogan. Copula-based synthetic data generation for machine learning emulators in weather and climate: application to a simple radiation model. Geoscientific Model Development Discussions, pages 1 21, 2021. [28] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, 2016. [29] A. Möller, A. Lenkoski, and T. L. Thorarinsdottir. Multivariate probabilistic forecasting using ensemble bayesian model averaging and copulas. Quarterly Journal of the Royal Meteorological Society, 139(673):982 991, 2013. [30] T. Nagler. kdecopula: An R package for the kernel estimation of bivariate copula densities. Journal of Statistical Software, 84(7):1 22, 2018. [31] T. Nagler and T. Vatter. pyvinecopulib. URL https://github.com/vinecopulib/ pyvinecopulib/. [32] T. Nagler, C. Schellhase, and C. Czado. Nonparametric estimation of simplified vine copula models: comparison of methods. Dependence Modeling, 5(1):99 120, 2017. [33] N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399 410. IEEE, 2016. [34] A. J. Patton. A review of copula models for economic time series. Journal of Multivariate Analysis, 110:4 18, 2012. [35] T. Qin, T.-Y. Liu, and H. Li. A general approximation framework for direct optimization of information retrieval measures. Information retrieval, 13(4):375 397, 2010. [36] A. E. Raftery, T. Gneiting, F. Balabdaoui, and M. Polakowski. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133(5):1155 1174, 2005. [37] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, pages 2263 2291, 2013. [38] M. Sklar. Fonctions de repartition an dimensions et leurs marges. Publ. inst. statist. univ. Paris, 8:229 231, 1959. [39] G. J. Székely, M. L. Rizzo, et al. Testing for equal distributions in high dimension. Inter Stat, 5 (16.10):1249 1272, 2004. [40] N. Tagasovska, D. Ackerer, and T. Vatter. Copulas as high-dimensional generative models: Vine copula autoencoders. In Advances in Neural Information Processing Systems, volume 32, 2019. [41] J. Tastu, P. Pinson, and H. Madsen. Space-time trajectories of wind power generation: Parametrized precision matrices under a gaussian copula approach. In Modeling and stochastic learning for forecasting in high dimensions, pages 267 296. Springer, 2015. [42] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. [43] H. Wang, L. Yu, Z. Cao, and S. Ermon. Multi-agent imitation learning with copulas. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 139 156. Springer, 2021. [44] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [45] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling Tabular data using Conditional GAN. In Advances in Neural Information Processing Systems, 2019. [46] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, and K. Weinberger. An empirical study on evaluation metrics of generative adversarial networks. ar Xiv preprint ar Xiv:1806.07755, 2018.