# ornstein_autoencoders__edef1818.pdf Ornstein Auto-Encoders Youngwon Choi and Joong-Ho Won Department of Statistics, Seoul National University, Republic of Korea muha@snu.ac.kr, wonj@stats.snu.ac.kr We propose the Ornstein auto-encoder (OAE), a representation learning model for correlated data. In many interesting applications, data have nested structures. Examples include the VGGFace and MNIST datasets. We view such data consist of i.i.d. copies of a stationary random process, and seek a latent space representation of the observed sequences. This viewpoint necessitates a distance measure between two random processes. We propose to use Orstein s d-bar distance, a process extension of Wasserstein s distance. We first show that the theorem by Bousquet et al. (2017) for Wasserstein auto-encoders extends to stationary random processes. This result, however, requires both encoder and decoder to map an entire sequence to another. We then show that, when exchangeability within a process, valid for VGGFace and MNIST, is assumed, these maps reduce to univariate ones, resulting in a much simpler, tractable optimization problem. Our experiments show that OAEs successfully separate individual sequences in the latent space, and can generate new variations of unknown, as well as known, identity. The latter has not been possible with other existing methods. 1 Introduction Most machine learning algorithms implicitly or explicitly assume that samples in the training and test datasets are drawn independently and identically from an unknown data distribution. However, this i.i.d. assumption is violated in many real-world tasks with nested data structures, i.e., when data were collected from grouped observational units. As a concrete example, consider the VGGFace2 dataset [Cao et al., 2018], an expansion of the famous VGGFace dataset [Parkhi et al., 2015]. VGGFace2 is a large-scale face dataset containing 3.31 million images of 9131 identities. For each person, it contains 362.6 images on average, with minimum of 30 and maximum of 843. These portraits are highly correlated within a single person, violating the i.i.d. assump- Contact Author tion. A similar issue arises in classification. The images of the MNIST dataset show strong correlations within a digit. When the categories are fixed like the MNIST data, a popular approach is to model the data distribution with a finite mixture model or class-conditional models. However, if the number of categories (classes) is too large or not even fixed, the use of these models may not be desirable. For example, in the VGGFace2 data the number of classes is 9,131. Since the identities are randomly sampled, any model trained with this dataset must deal with the increasing number of classes for generalizability. Even with fixed categories, class imbalance is a big problem in learning with these models. Random effects models [Diggle et al., 2002; Fitzmaurice et al., 2012] provide a flexible framework for handling both regimes. Applying those models is a standard approach in statistics when there are correlations among observational units within a group or subject. As an example, consider the random intercept model (with no slope): yi j = µ0 + bi + εi j, εi j iid N(0, σ2 0), bi iid N(0, τ 2 0 ), bi εi j, where yi j represent the jth observation within subject i. Note due to the presence of the random intercept bi, the sequence of observations {yi j}ni j=1 in subject i are correlated with correlation coefficient τ 2 0 /(σ2 0 + τ 2 0 ). This is the simplest example of linear mixed effects models. In machine learning, Dundar et al. [2007] show that classifiers with a linear mixed effects model allow us to explicitly model the dependence in noni.i.d. data. Differing number of samples between groups is naturally handled. The reader may have noticed that the random intercept model defines an infinite exchangeable random process. Let Y = (. . . , Y 1, Y0, Y1, . . . ) be a (doubly) infinite sequence with coordinates Yj s are conditionally independent given B. If B N(0, τ 2 0 ) and Yj|{B = b} iid N(µ0 + b, σ2 0), then {yi j} j= is a realization of the ith i.i.d. copy of the random process Y . On the other hand, both VGGFace2 and MNIST data consist of exchangeable sequences nested within subjects or classes: the order of portraits of any given person does not affect any conceivable learning task. The goal of this paper is to bring the nested data structure that arises from various applications down to generative la- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) tent variable modeling. If the latent variables share the nested structure of the observed variables, then the generative power of the latent space representation is likely to increase. As discussed above, this nesting often translates to i.i.d. observations of a correlated random process. Our main contributions are as follows: We introduce Ornstein s d-bar distance to the community, which is an optimal transport distance between random processes. We show that the theorem by [Bousquet et al., 2017] extends to stationary processes and propose the Ornstein auto-encoder (OAE), which can be thought of as a stationary random process version of the Wasserstain autoencoder (WAE) [Tolstikhin et al., 2018]. When exchangeability is assumed, we show that the optimization problem for OAE greatly reduces to almost the same as that of WAE, enabling a simple algorithm. We empirically show that the generative power of OAE surpasses state-of-the-arts. Importantly, OAE is robust to data imbalance and can generate new variations of unknown, out-of-training-set subjects, which has been impossible with other methods. We demonstrate that OAE can provide disentangled representations, i.e., latent variables are well-clustered by subjects. This capability has a potential applications in classification and recognition. In the next section, we provide a necessary background. Section 3 introduces Ornstein s d-bar distance and develop OAE for stationary and exchangeable processes, respectively. In Section 4 we demonstrate the power of OAE using VGGFace2 and MNIST data. We conclude the paper in Section 5. Proofs, additional examples, and details of implementation are given in the Online Supplement available at https://tinyurl.com/y5x6ufuj. 2 Preliminaries 2.1 Notation The space of observable variables is denoted by X. The space of latent variables is denoted by Z. We assume that both X and Z are standard measurable spaces so that conditional probability distributions are well-defined. Unless necessary, the associated event spaces BX and BZ are suppressed. We also assume that X is a complete, separable metric space equipped with metric d. Its Cartesian product space is denoted by X n for n = 1, 2, . . . ; n = is allowed. Both random variables and random processes are represented by capital letters (e.g., X), and their realizations by lower case letters (e.g., x). A random process is always two-sided. An i.i.d. copy of X for subject i is denoted by superscript Xi. If Xi is a random process, its coordinate random variable is written using a subscript, e.g., Xi j. The probability distribution of random variable or process X is denoted by PX. The joint distribution of X and Y is denoted by PXY ; conditional distributions are written as PX|Y . A finite-length random vector induced by random process X is written as X1:n etc. 2.2 Generative Latent Variable Models Generative latent variable models (LVMs) are a family of parametric models trained to transform samples drawn from an unknown distribution PX on X to latent variables in a lower dimensional space Z. In many real-world data, especially images, we cannot estimate the density of PX, which may not exist because the distribution is supported by low dimensional manifolds. To overcome this problem, LVMs define a latent random variable Z Z with a prior distribution PZ such as the standard Gaussian, and learn a decoder Q ˆ X|Z, or a conditional distribution of the reconstructed input ˆX X given Z. The marginal distribution of the reconstruction ˆX is given by P ˆ X = R Q ˆ X|Zd PZ, and we learn the decoder Q ˆ X|Z by solving the following optimization problem inf Q ˆ X|Z D(PX, P ˆ X) (1) for some distance measure D between the data and reconstruction distributions, with a possible addition of a regularization term. Different choices of D and regularizer yield different model. For example, Wasserstein auto-encoders (WAE) [Tolstikhin et al., 2018] utilizes the p-Wasserstein distance between X and ˆX on the metric space (X, d) [Villani, 2008] dp(PX, P ˆ X) inf π P(PX,P ˆ X) Eπdp(X, ˆX) min(1,1/p) (2) but the pth power of dp: DWAE(PX, P ˆ X) inf π P(PX,P ˆ X) Eπdp(X, ˆX). (3) when p 1; P(PX, P ˆ X) is the set of joint distributions on (X, ˆX) whose marginals on X and ˆX are PX and P ˆ X, respectively. Tolstikhin et al. [2018] use Theorem 1 of Bousquet et al. [2017] to reparametrize (3) in terms of probabilistic encoder QZ|X. Write QZ = R QZ|Xd PX. Then we have DWAE(PX, P ˆ X) = inf QZ|X:QZ=PZ EPXEQZ|Xdp(X, g(Z)), (4) when the decoder is deterministic, i.e., Q ˆ X|Z( |z) is a Dirac measure on g(z) for all z Z. In practice, the resulting constrained optimization problem is relaxed to an unconstrained one: inf g inf QZ|X EPXEQZ|Xdp(X, g(Z)) + λDZ(PZ, QZ) (5) for some divergence measure DZ and λ > 0. Relaxation (5) of WAE is equivalent to the adversarial auto-encoders (AAE) [Makhzani et al., 2016] if p = 2, X is Euclidean, d(x, y) = x y is the standard Euclidean norm, and DZ is DGAN(PX, P ˆ X) sup f F EPX log f(X)+EPZ log(1 f(g(Z))) where f : X (0, 1) is the discriminator [Bousquet et al., 2017]. In addition, the conditional AAE (c AAE) minimizes a class-conditional version of AAE. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 3 The Ornstein auto-encoder (OAE) 3.1 From Ornstein s d-bar Distance to OAE In order to extend WAE to random processes, we need a distance metric between two random sequences X = (. . . , X 1, X0, X1, . . . ) and Y = (. . . , Y 1, Y0, Y1, . . . ), both defined in X . Let j=1 dp(xj, yj), where d is a metric on X and p 0; d0 denotes the 0-1 loss. A possible distance measure between the latter two is ρn(PX1:n, PY1:n) inf π P(PX1:n,PY1:n) Eπρn(X1:n, Y1:n). Then a process distance between PX and PY is defined by ρ(PX, PY ) sup n 1 n ρn(PX1:n, PY1:n). It is known that dp(PX, PY ) ρmin(1,1/p)(PX, PY ) is a metric on the space of all possible stationary processes in X, so dp is a true distance. The dp or d-bar distance for random processes was introduced by Ornstein [1973] for the special case of p = 0 and discrete X, and was extended to p 0 with more general X by Gray, Neuhoff, and Shields [1975]. Furthermore, if PX and PY are stationary, then we have ρ(PX, PY ) = inf π Ps(PX,PY ) Eπdp(X0, Y0), (6) where Ps(PX, PY ) is the set of all jointly stationary distributions on (X, Y ) X X having PX and PY as marginals [Gray et al., 1975]. From the resemblance of equation (6) to the finitedimensional Wasserstein metric (2), a reparametrization similar to (4) can be made: Theorem 1. Assume process distributions PX on X and PZ on Z are both stationary. Also assume that Q ˆ X|Z( |z) is the Dirac measure on g(z) for all z, i.e., ˆX = g(Z) with probability 1 for g : Z X that maps a stationary sequence to a stationary sequence. Then, ρ(PX, P ˆ X) = inf QZ|X QZ|X EPXEQZ|Xdp(X0, g(Z)0), where P ˆ X = R Q ˆ X|Zd PZ and QZ|X is the set of encoders QZ|X such that QZ|XPX is jointly stationary in (X, Z) and R QZ|Xd PX = PZ. By defining DOAE(PX, P ˆ X) = inf QZ|X QZ|X EPXEQZ|Xdp(X0, g(Z)0) and minimizing it over g, we obtain the OAE model. Similar to relaxation (5), we may solve an unconstrained problem inf g inf QZ|X EPXEQZ|Xdp(X0, g(Z)0) + λDZ(PZ, QZ), (7) where QZ = R QZ|Xd PX. The additional constraint of QZ|XPX being stationary can be satisfied by restricting QZ|X to be stationary (the latter implies the former). Despite the apparent similarity to WAE (5), problem (7) has two practical issues. First, the decoder g for OAE needs to map an infinite sequence to another infinite sequence. Learning such a map with infinite memory may face computational challenges. Second, since PZ and QZ are both process distributions, computing the divergence DZ may also run into trouble. 3.2 OAE for Exchangeable Data If we can assume that the pair process {(Xj, Yj)} is exchangeable, then computation of ρ(PX, PY ) amounts to that of WAE (4), a great simplification: Theorem 2. Assume pair process {(Xj, Yj)} in X X is exchangeable. Let PX and PY denote its marginal distributions on {Xj} and {Yj}, respectively. Then ρ(PX, P ˆ X) = ρ1(PX0, PY0) = inf π P(PX0,PY0) Eπdp(X0, Y0). Exchangeability of the pair process is valid for our applications, because the jth observation Xi j of subject i and its reconstruction ˆXi j must be exchangeable with the kth observation-reconstruction pair (Xi k, ˆXi k) of the same subject. Theorem 2 ensures an alternative parametrization (cf. (4)) ρ(PX, PY ) = inf QZ0|X0:QZ0=PZ0 EPX0 EQZ0|X0 dp(X0, g(Z0)), and the optimization problem of the form (5). Here the decoder g only takes a single coordinate of the latent process Z as input and outputs a single coordinate of Y . We explicitly model exchangeability in the latent space by introducing a random variable B and conditioning Zj on B: PZ1:n = R Qn j=1 PZ0|Bd PB for all n. The probabilistic encoder is a pair (QZ0|B,X0, QB|X0). Constraining R QZ0|B,X0d PX0 = PZ0|B and R QB|X0d PX = PB ensures QZ = PZ. A relaxation like (5) yields inf g inf QZ0|B,X0 inf Qb|X0 EPX0 EQZ0|B,X0 EQB|X0 dp(X0, g(Z0)) + λ1DZ0|B(PZ0|B, QZ0|B) + λ2DB(PB, QB) , (8) where QZ0|B = R QZ0|B,X0d PX0 = PZ0|B and QB = R QB|X0d PX = PB, for appropriate choices of divergence measures DZ0|B and DB. If we use DZ0|B = DGAN and DB = DMMD,κ where DMMD,κ(PB, QB) = EPBκ( , B) EQBκ( , B) 2 H is the maximum mean discrepancy (MMD) [Gretton et al., 2012] for a positive definite kernel κ : Z Z R that induces a reproducing kernel Hilbert space H equipped with the norm H, then we obtain a training algorithm described in Algorithm 1, based on the sample estimates of the terms in (8). Lines 6 and 7 of Algorithm 1 need some explanation. Since the encoder QB|X0 takes only a single coordinate as its input, it yields bi j QB|X0( |xi j) for each j = 1, . . . , mi. In order to obtain a single sample, we aggregate bi j s so that bi = 1 mi Pmi j=1 bi j, as used in Line 6. In Line 7, we sample zi j independently from QZ|B,X0( | bi, xi j) given this bi and the data xi j, for j = 1, . . . , mi. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Algorithm 1 Ornstein Auto-Encoder for Exchangeable Data Input: Exchangeable sequences (xi 1, ..., xi ni) for i = 1, ..., L Output: Encoder pair (QZ|B,X0, QB|X0) and decoder g Require: Latent variable distributions PB, PZ0|B, regularization coefficients λ1, λ2, positive definite kernel κ 1: Initialize: parameters of (QZ|B,X0, QB|X0), g, and discriminator f 2: while QZ|B,X0, QB|X0, f, g not converged do 3: Sample subjects i = 1, . . . , n and sequence (xi 1, . . . , xi mi) for each subject i from the training set 4: Sample bi from PB for i = 1, . . . , n 5: Sample (zi 1, . . . , zi mi) from PZ0|B given bi for i = 1, . . . , n 6: Sample bi from QB|X0 given (xi 1, , xi mi) for i = 1, . . . , n. 7: Sample ( zi 1, , zi mi) from QZ|B,X0 given bi and (xi 1, , xi mi) for i = 1, ..., n. 8: Update QZ|B,X, QB|X, and g by descending: j=1 dp(xi j, g( zi j)) λ1 j=1 log f( zi j)+ λ2 n(n 1) P i =l κ(bi, bl) + P i =l κ( bi, bl) 2λ2 i,l κ(bi, bl) 9: Update f by ascending: j=1 log f(zi j) + log(1 f( zi j)) 10: end while 3.3 Generating Variations of Unknown Subjects A trained OAE can be used to generate a sequence of variations for a new, unknown subject out of the training examples. Suppose one or few input(s) (xnew 1 , . . . , xnew mnew) from a new subject are given to OAE. Then we sample bnew j from QB|X0( |xnew j ) for j = 1, 2, . . . , get bnew = 1 mnew Pmnew j=1 bnew j , and sample znew j from PZ0|B( |bnew). Then new variations (ˆxnew 1 , ˆxnew 2 , . . . ) are obtained by passing (znew 1 , znew 2 , . . . ) through the trained decoder g. Fine control on the variations is possible if further assumptions on the encoder are made; see 4. Generating new variations of a known, in-training subject can also be conducted in the same fashion. Note that generating images from an unknown subject is impossible for the existing conditional LVMs, e.g., c AAE, because they require a fixed number of conditional distributions. When data imbalance is present, OAE has an advantage over conditional LVMs because the latter have to train all the conditional encoders, which is hard for minority groups with small sample sizes. OAE handles this problem by sharing a variance component. 4 Experiments 4.1 Implementation In all the experiments in the following, we assumed X and Z are Euclidean spaces with dimensions dx and dz, respec- tively; accompanied Euclidean metric d(x, x ) = x x 2 on X and p = 2 were used. We set the prior distribution PZ of the latent variable Z as a random intercept model: Zi j|{Bi = bi} iid N(µ01 + bi, σ2 0I), Bi iid N(0, τ 2 0 I). The encoder pair (QZ|B,X0, QB|X0) was designed to be another random intercept model: Zi j|{Bi = bi, Xi j = xi j} iid N(µ(xi j) + bi, σ2(xi j)I) Bi|{Xi j = xi j} iid N(ν(xi j), τ 2I), (9) where the mean functions µ : X Z, ν : X X, and the variance function σ2 : X R++ were parameterized by deep neural networks. The hyperparameter τ was kept small. Although Gaussian encoders are suboptimal to our optimization problem (8) due to the restricted search space, Rubenstein et al. [2018] has shown empirically that such a restriction produces better outcomes when the appropriate number of dimensions for the latent space is not known. The decoder g was also parameterized by deep neural networks. Interpreting each subject as a class, we compared OAE with c AAE with conditional Gaussian latent variables: Zi j|{Y i = k} iid N(µ0k1, σ2 0k I), where C is the number of subjects, and µ0k, σ2 0k are prespecified for k = 1, . . . , C. Similar to OAE, we used a Gaussian encoder for QZ|Y,X: Zi|{Xi = xi, Y i = k} N µk(xi), σ2 k(xi) , where µk : X X, σ2 k : X R++ are parameterized by deep neural networks for k = 1, . . . , C. For optimization, we used the Adam [Kingma and Ba, 2014] optimizer with β1 = 0.5 for updating the first moment estimate and β2 = 0.999 for updating the second moment estimate. When generating new variations of a given subject from the test dataset, we used one image per subject. For all convolutional layers, we used the batch normalization [Ioffe and Szegedy, 2015], padding, and truncated normal initialization. 4.2 A Toy Model To see if OAE can learn a known low dimensional distribution embeded in a higher dimension, we generated training samples Zi j = bi + εi j from the two-dimensional latent space for i = 1, 2, ..., 100, j = 1, 2, ..., 5000 with εi j N 0 0 , 0.009 0 0 0.007 , bi N 0.2 0.4 , 1.018 0.12 0.12 0.745 , and embedded them into four-dimensional Euclidean space by Xi j = AZi j with A = 0.027 0.171 0.084 0.290 0.252 0.388 0.248 0.371 For learning the representation, we misspecified the twodimensional latent variable for i = 1, 2, ..., n, j = 1, 2, ..., ni as Zi j bi + εi j, εi j N 0, 0.01I , bi N 0, I . Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) and trained the OAE with a simple architectures with 4.8k parameters. We used the linear decoder to restrict the generated sample distribution to be normal. The model was trained for 50 epochs with mini-batch size 3000, λ1 = 10, λ2 = 10 and the learning rates of 0.01 for the encoder-decoder and 0.005 for the discriminator. After training, we generated samples through the decoder with n = 100 and ni = 500, and measured the error between the samples and the true moments. The root-mean squared error (RMSE) of the mean Ebi[E[ ˆXij|bi]] was 0.0233, and the RMSE of the covariance matrix Ebi[cov[ ˆXij|bi]] was 0.0001. This result shows that the OAE works well on this toy but informative model. 4.3 VGGFace2 Dataset Recall that in the VGGFace2 dataset the portraits of each individual are highly correlated and exchangeable. It is also highly imbalanced, with number of portraits per person varying from 30 to 843. The goal of this experiment is to examine the capability of OAE in generating new variation of portraits of both known and unknown subjects in the presence of many subjects (classes) and data imbalance. As emphasized in the previous section, generating images from an unknown subject is impossible with existing (conditional) LVMs, e.g., c AAE. For known subjects, we compared the sample quality of OAE with that of c AAE. For unknown subject, c AAE cannot generate samples, and we compare the quality of the generated samples with WAE, which ignores the subject information. Algorithm Parameters. We chose dz = 128 as the latent space dimension, and used hyperparameters µ0 = 0, σ2 0 = 1, τ 2 0 = 100. The encoder-decoder architecture had 13.6M parameters and the discriminator had 12.8M parameters. We set λ1 = 10, λ2 = 10 for OAE, and λ = 10 for WAE and c AAE. All models were trained for 100 epochs with a constant learning rate of 0.0005 for the encoder and decoder, and 0.001 for the discriminator. We used mini-batches of size 200. Training. As a pre-processing, we cropped the faces and rescaled them to a common size of 64 by 64. We constructed a training set of 146,519 images from 500 randomly chosen subjects. Since the number of subjects far exceeded the mini-batch size and the dataset is highly imbalanced, we used importance sampling to limit both the number of subjects and the maximum number of variations per mini-batch in early training epochs. For data augmentation, we either added white Gaussian noise to or vertically flipped randomly chosen images in a mini-batch. Evaluation Measures. The quality of reconstruction of a given image was measured by the mean squared error (MSE). The quality of generated samples was quantified by the sharpness using the Laplace filter [Rubenstein et al., 2018], and the Frechet inception distance (FID) between image distributions [Heusel et al., 2017]. Both are commonly used in the LVM literature. For FID, we picked 100 images from the generated samples and the test dataset. Generating New Portraits of Known Subjects. We constructed a test dataset (Testset 1) with 11,250 images of 49 subjects from training dataset. We generated 100 new variations for each subject using OAE and c AAE. Table 1 suggests Known subjects (Testset 1) Unknown subjects (Testset 2) MSE FID Sharpness MSE FID Sharpness OAE 28.551 151.994 1 10 4 34.492 156.935 1 10 4 c AAE 46.020 152.077 1 10 4 - - - WAE - - - 33.469 163.612 1 10 4 Testset - - 4 10 3 - - 3 10 3 Table 1: VGGFace2 evaluation. MSE (lower is better), FID (lower is better), sharpness (similar to testset is better). Figure 1: Generated new variations from unknown subjects of VGGFace2. Each row corresponds to a subject. Columns 1 and 2 show randomly chosen test images from the person. Column 3 shows generated images from the estimated random intercept. Columns 4 and on represent the generated images using common variations. that OAE could generate quality variations for known identities better than c AAE. Generating New Portraits of Unknown Subjects. We constructed another test dataset (Testset 2) with 11,250 images of 49 subjects from randomly chosen 500 subjects not used for training. We generated 100 new variations for each subject from OAE, and 4,900 images from WAE, in which subject identity cannot be used. Table 1 shows that OAE can generate new variations for given but unknown identities with sample quality comparable to WAE, which can only generate random identities. Figure 1 presents some generated variations of unknown subjects. Vector Arithmetic. The random intercept modeling of the encoder (9) allows an additional advantage of performing vector arithmetic on the portraits. Suppose an image xi 0 of person i is given and we want to generate a variation similar to the lth image of another person k. If bi is the intercept of person i in the latent space obtained by applying encoder QB|X0 to xi 0, and (zk l , bk), (zi 0, bi) and (zk l , bk) are the encoding of xi 0 and xk l , then zk l bk + bi exchange the mean of zk to the mean of zi. Hence decoding ˆxi l = g(zk l bk + bi) amounts to switching the identity of xk l to that of person i. Figure 2 demonstrates some results when both target and base persons are chosen from unknown subjects. This generalizability is unique to OAE, and suggests that OAE can be a useful data augmentation tool for many applications such as face recognition in the presence of high imbalance. Subject-level Disentanglement in Representation. Another benefit of our random process modeling is that subjects Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 2: Vector arithmetic results for VGGFace2. Row 1, images of the base person. Row 2, reconstruction of Row 1. Row 3, input (highlighted) and generated images using vector arithmetic (rest). Figure 3: t-SNE map of the encoded images from VGGFace2. (a) known subjects. (b) unknown subjects. (c) WAE. Each color represents a single preson. can be well-separated in the latent space. Figure 3 shows t SNE maps [Maaten and Hinton, 2008] of the latent space representation of randomly selected 225 images of 10 subjects, known or unknown. For known subjects, clustering by subject is clear. Unknown subjects are also separated well, judged by visual inspection and by the ratio of within-group sum of squares (SSW) to between-group sum of squares (SSB); SSW/SSB is less than 1 for both cases. By design, WAE could not separate subjects at all. 4.4 MNIST Dataset The goal of this experiment is to see how well OAE performs when the number of subjects is given and fixed. With balanced data, conditional methods such as c AAE are expected to perform well. In the presence of class imbalance, however, random process-based OAE has an advantage due to its generalizability. Algorithm Parameters. We chose dz = 8 as the latent space dimension, and used hyperparameters µ0 = 0, σ2 0 = 1, τ 2 0 = 100. The encoder-decoder architecture had 6.1M parameters and the discriminator had 265k parameters. We set λ1 = 10, λ2 = 10, and λ = 10. All models were trained for 100 epochs with mini-batch size 100, with learning rates of 0.01 for the encoder-decoder and 0.005 for the discriminator which were manually halved at the 30th and 50th epochs. The network architectures for c AAE and WAE were mostly the same as OAE except for the random intercept part. Evaluation Measures. Similar to the VGGFace2 experiment, we evaluated the MSE of the reconstruction of given images and measured the sharpness of generated images. To compare the class-conditional generation quality, we generated class-conditional samples from OAE and c AAE, than calculated the classification accuracy of the generated digits, measured by a pre-trained deep MNIST digit classifier with 99.2% accuracy. Additionally, we compared the Balanced training data Imbalanced training data MSE Accuracy SSIM MSE Accuracy SSIM OAE 0.793 0.992 0.318 0.977 0.972 0.320 c AAE 0.572 0.877 0.224 0.661 0.839 0.190 WAE 0.646 - - 0.759 - - Testset - 0.999 0.235 - 0.999 0.235 Table 2: MNIST evaluation. MSE (lower is better), accuracy (larger is better), sharpness (similar to testset is better), SSIM (similar to testset is better). diversity of the generated samples per class by evaluating the structural similarity (SSIM), which is a perceptual similarity metric range between 0 and 1 [Wang et al., 2004; Odena et al., 2017]. We evaluated the mean SSIM score of 50 randomly chosen image pairs conditioned on each digit, and took the average of the digit-wise mean SSIM scores. Balanced Training Data. We used a balanced training data with 10 classes of 56,000 images and a balanced test data with 10 classes of 1,000 images. We also generated 10 classes of 1,000 images from c AAE and OAE, and 10,000 images from WAE ignoring classes. The accuracy shown in Table 2 suggests that OAE mostly generated correct digits whereas c AAE sometimes failed. The slightly higher reconstruction error did not harm the classifier. The diversity of generated samples were similar. Imbalanced Training Data. In order to create an imbalanced dataset, we dropped 90% of images in randomly chosen three classes (digits of 0, 3, and 4) from the balanced training set. The resulting set had 10 classes, 40,933 images. Table 2 reveals that the accuracy gap between OAE and c AAE for the generated samples widened in the imbalanced setting. Additional Examples. Online Supplementary Material https://tinyurl.com/y3ghw3yp contains additional visualizations for generating new variations of digits and disentanglement in the representation space. 5 Conclusion In this work we paid attention to the nested data structure of common machine learning datasets, which led us to view the data as a collection of i.i.d. observations of exchangeable random processes. We then introduced the optimal transport distance between stationary random processes. Using this, we proposed the Ornstein auto-encoder, which, under exchangeability inherently residing in the data, reduces to a tractable optimization problem. Our random process approach allowed us to generate correlated samples for the unknown subjects never used in training, which has been impossible for previous works on generative latent variable models. In the future, we plan to expand this work to nonexchangeable stationary random processes. Another helpful direction would be latent variable modeling of multilevel data, which often arise in biomedical applications. Acknowledgments This work is a part of SNU-Samsung Smart Campus research program, supported by Samsung Electronics Co., Ltd. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) References [Bousquet et al., 2017] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. ar Xiv preprint ar Xiv:1705.07642, 2017. [Cao et al., 2018] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), pages 67 74, 2018. [Diggle et al., 2002] Peter Diggle, Peter J Diggle, Patrick Heagerty, Patrick J Heagerty, Kung-Yee Liang, Scott Zeger, et al. Analysis of longitudinal data. Oxford University Press, 2002. [Dundar et al., 2007] Murat Dundar, Balaji Krishnapuram, Jinbo Bi, and R. Bharat Rao. Learning classifiers when the training data is not IID. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pages 756 761. Morgan Kaufmann Publishers Inc., 2007. [Fitzmaurice et al., 2012] Garrett M Fitzmaurice, Nan M Laird, and James H Ware. Applied longitudinal analysis, volume 998. John Wiley & Sons, 2012. [Gray et al., 1975] Robert M Gray, David L Neuhoff, and Paul C Shields. A generalization of Ornstein s d distance with applications to information theory. The Annals of Probability, pages 315 328, 1975. [Gretton et al., 2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. [Heusel et al., 2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems 30, pages 6626 6637. Curran Associates, Inc., 2017. [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 448 456. PMLR, 2015. [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014. [Maaten and Hinton, 2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579 2605, 2008. [Makhzani et al., 2016] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016. [Odena et al., 2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxil- iary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, pages 2642 2651. JMLR.org, 2017. [Ornstein, 1973] Donald S Ornstein. An application of ergodic theory to probability theory. The Annals of Probability, 1(1):43 58, 1973. [Parkhi et al., 2015] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In British Machine Vision Conference, volume 1, page 6, 2015. [Rubenstein et al., 2018] Paul K. Rubenstein, Bernhard Sch olkopf, and Ilya Tolstikhin. Wasserstein autoencoders: Latent dimensionality and random encoders. In Workshop at the 6th International Conference on Learning Representations (ICLR), May 2018. [Tolstikhin et al., 2018] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. [Villani, 2008] C edric Villani. Optimal Transport: Old and new, volume 338. Springer Science & Business Media, 2008. [Wang et al., 2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600 612, 2004. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)