# bayesian_regularization_of_latent_representation__83dd17c1.pdf Published as a conference paper at ICLR 2025 BAYESIAN REGULARIZATION OF LATENT REPRESENTATION Chukwudi Paul Obite Zhi Chang Keyan Wu Shiwei Lan School of Mathematical & Statistical Sciences, Arizona State University 901 S Palm Walk, Tempe, AZ 85287, USA The effectiveness of statistical and machine learning methods depends on how well data features are characterized. Developing informative and interpretable latent representations with controlled complexity is essential for visualizing data structure and for facilitating efficient model building through dimensionality reduction. Latent variable models, such as Gaussian Process Latent Variable Models (GP-LVM), have become popular for learning complex, nonlinear representations as alternatives to Principal Component Analysis (PCA). In this paper, we propose a novel class of latent variable models based on the recently introduced Q-exponential process (QEP), which generalizes GP-LVM with a tunable complexity parameter, q > 0. Our approach, the Q-exponential Process Latent Variable Model (QEP-LVM), subsumes GP-LVM as a special case when q = 2, offering greater flexibility in managing representation complexity while enhancing interpretability. To ensure scalability, we incorporate sparse variational inference within a Bayesian training framework. We establish connections between QEPLVM and probabilistic PCA, demonstrating its superior performance through experiments on datasets such as the Swiss roll, oil flow, and handwritten digits. Keywords: Dimensionality Reduction, Latent Variable Models, Representation Complexity Regularization, Variational Inference, Generative Models. 1 INTRODUCTION In various learning tasks involving high-dimensional data, it is crucial to effectively learn a representation of the data that lies in a low-dimensional subspace. Such latent representation is often the key in unsupervised learning to understand the complex data structure which is infeasible to visualize in the original space. What is more, reducing data to low-dimensional latent space is also the core to build efficient models in supervised learning. Principal component analysis (PCA) (Pearson, 1901; Jolliffe, 1986; Jolliffe & Cadima, 2016) has long been used as a classic technique in factor analysis. It is later cast into a probabilistic latent variable model (LVM) with a linear (Tipping & Bishop, 1999; Minka, 2000) or a nonlinear (Sch olkopf et al., 1998) mapping from the latent space to the data feature space. In the probabilistic PCA literature, Lawrence (2003; 2005) propose to employ Gaussian process (GP) (Rasmussen & Williams, 2005) as a prior for the mapping in LVM. Such GP-LVM views data as functional outputs of latent variables and can be regarded as a linear or nonlinear probabilistic PCA through a linear or more general kernel respectively. Titsias & Lawrence (2010) then develop a Bayesian version of GP-LVM and an efficient variational Bayes inference with sparse approximation through inducing points (Titsias, 2009). Since then it has been extensively applied in neuroscience (Gundersen et al., 2021), bioinformatics (Ahmed et al., 2018), and robotics (Delgado-Guerrero et al., 2020), etc. More recent development extends GP-LVM to discriminative classification (Urtasun & Darrell, 2007), Gaussian mixture models (Nickisch & Rasmussen, 2010), deep probabilistic models (Damianou & Lawrence, 2013), inverse problems (Atkinson & Zabaras, 2019), and longitudinal modeling (Le & Honavar, 2020). slan@asu.edu Published as a conference paper at ICLR 2025 All the aforementioned works are based on GP which tends to be over-smooth for inhomogeneous data with abrupt changes or sharp contrast. To address this issue, a new stochastic process named Q-Exponential process (Q-EP) (Li et al., 2023) has been proposed based on Lq penalty to regularize the modeling effect through a parameter q > 0 which also embraces GP as a special case for q = 2. In this paper, we aim to port such flexibility in regularization to LVM. More specifically, we replaces GP in GP-LVM (Lawrence, 2003) with Q-EP to propose a novel class of LVMs named Q-exponential Process Latent Variable Models (QEP-LVM) parameterized by the regularization parameter q > 0. We establish its connection to the probabilistic PCA and provide explicit formula for the maximum likelihood estimator of the latent variable singular values. We also develop the Bayesian QEPLVM and adopt variational inference as Titsias & Lawrence (2010) with sparse approximation via inducing points (Titsias, 2009). To derive the evidence lower bound (ELBO) for the variational form of Bayesian QEP-LVM, we adopt the approach by Hensman et al. (2015) to directly compute the variational distribution of latent function instead of using variational calculus (Titsias & Lawrence, 2010). Though not straightforward in the Q-EP setting, tractable form of ELBO can be obtained with the help of Jensen s inequality. We demonstrate the regularization effect through the parameter q > 0 on the latent representations learned with QEP-LVM some of which exhibit controlled compactness or enhanced interpretability compared with GP-LVM (q = 2) (See Section 4). It mains as an open question to determine the optimal q automatically based on the given data. We therefore adopt a Bayesian approach by imposing appropriate priors on q to obtain an optimal choice. With chosen regularization parameters, we also investigate the generative classification models built on fitted LVMs and show that GP-LVM is often sub-optimal, especially for inhomogeneous data. All the numerical examples have been efficiently implemented in GPy Torch (Gardner et al., 2018). Connection to existing works on non-Gaussian LVMs Our proposed QEP-LVM directly generalizes GP-LVM (Titsias & Lawrence, 2010) with flexible regularization on the learned latent representations. There are also some works regarding non-Gaussian LVMs falling in the same category of QEP-LVM. Palmer et al. (2005) theoretically characterize variational representation of LVMs with non-Gaussian priors constructed as a supremum or an integral over the scale parameter Gaussian densities. Kleppe & Skaug (2008) build and fit non-Gaussian LVMs via the moment-generating function. Salehkaleybar et al. (2020); Xie et al. (2022); Chen et al. (2022) consider linear causal LVMs with non-Gaussian noises. Gundersen et al. (2021); Zhang et al. (2023); Li et al. (2024) propose non-Gaussian LVMs with random Fourier features in the Karhunen-Lo eve expansion of GP tailored to non-Gaussian data. The non-Gaussianity of these LVMs is constructed by certain composition of GPs. On the contrary, the QEP-LVM is, if not the first, developed directly based on a non-Gaussian stochastic process. Our proposed work has multi-fold contributions to the literature of LVM: 1. We propose a novel LVM based on Q-EP which learns latent data representation with flexible complexity. 2. We establish the connection between QEP-LVM and nonlinear probabilistic PCA. 3. We demonstrate the regularization effect of Bayesian QEP-LVM which enables enhanced latent representation learning compared with GP-LVM. The rest of the paper is organized as follows. Section 2 reviews Q-EP as a flexible prior for Bayesian multi-output regression. We then introduce QEP-LVM as a nonlinear probabilistic PCA and the variational inference for Bayesian QEP-LVM in Section 3. Section 4 investigates the regularization effect of QEP-LVM via parameter q > 0 and demonstrate that the optimal q leads to better latent representation compared with that obtained by GP-LVM (q = 2). Finally, we conclude in Section 5 with discussion on some future directions. 2 BACKGROUND: BAYESIAN MODELS WITH Q-EXPONENTIAL PRIORS 2.1 Q-EXPONENTIAL DISTRIBUTION AND PROCESS Motivated by Lq regularization, Li et al. (2023) propose the multivariate q-exponential distribution that is exchangable and consistent with respect to marginalization. Published as a conference paper at ICLR 2025 Definition 1. A multivariate q-exponential distribution for a random vector u RN, denoted as q-EDN(µ, C), has the following density p(u|µ, C, q) = q 2 exp r q 2 2 , r(u) = (u µ)TC 1(u µ). Remark 1. The above density (log-convex, heavy-tailed around mean for 0 < q < 2 and logconcave, light-tailed for q > 2), when taken negative logarithm, is dominated by some weighted Lq norm of u µ, i.e. 1 2 u µ q C. From the optimization perspective, q-EDN, when used as a prior, imposes Lq regularization in obtaining the maximum a posteriori (MAP). Suppose a function u(x) is observed at N locations, x1, , x N D Rd. Li et al. (2023) define the following q-exponential process (Q-EP) based on a scaled q-exponential distribution q-ED N(0, C) assumed for the random vector u = (u(x1), , u(x N)). Definition 2. A (centered) q-exponential process u(x) with kernel C, q-EP(0, C), is a collection of random variables such that any finite set, u := (u(x1), u(x N)), follows a scaled multivariate q-exponential distribution q-ED (0, C), i.e., N 1 q u q-EDN(0, C), where C = [C(xi, xj)]N N. If C = I, then u is said to be marginally identical but uncorrelated (m.i.u.). Remark 2. If q = 2, q-EDN(µ, C) reduces to NN(µ, C) and q-EP(0, C) becomes GP(0, C). When q [1, 2), q-EP(0, C) lends flexibility to modeling functional data with more regularization than GP (See Figure 1 for the regularization effect of q). For multiple Q-EPs, (u1(x), , u D(x)), we define multi-output (multivariate) Q-EP through matrix vectorization, vec(UN D) = [u T 1 , , u T D] T, which forms a vector by concatenating its columns. Note, the component processes are often assumed uncorrelated, rather than independent. Definition 3. A multi-output (multivariate) q-exponential process, u( ) = (u1( ), , u D( )), each uj( ) q-EP(µj, Cx), is said to have association Ct if at any finite locations, x = {xn}N n=1, vec([u1(x), , u D(x)]N D) q-EDND(vec(µ), Ct Cx), where we have uj(x) = [uj(x1), , uj(x N)]T, for j = 1, . . . , D, µ = [µ1(x), , µD(x)]N D and Cx = [Cx(xn, xm)]N N. We denote u q-EP(µ, Cx, Ct). In particular, the component processes are m.i.u. if Ct = ID. 2.2 BAYESIAN REGRESSION WITH Q-EP PRIORS Given data x = {xn}N n=1 and y = {yn}N n=1, for the generic Bayesian regression model: y = f(x) + ε, ε q-EDN(0, Σ), f q-EP(0, C), (1) we have a tractable posterior (predictive) distribution similar to GP regression as in Theorem 3.5 of Li et al. (2023). Denote X = [x1, , x Q]N Q, F = [f1(X), , f D(X)]N D and Y = [y1, , y D]N D. With m.i.u. Q-EP priors as in Definition 3 imposed on f := (f1, , f D), now we consider the following multivariate regression problem: likelihood : vec(Y)|F q-EDND(vec(F), ID Σ), prior on latent function : f q-EP(0, C, ID). (2) Noticing that Y = F + ε with vec(ε) q-ED(0, ID Σ), we can find the marginal of Y based on the property of q-ED (Fang & Zhang, 1990) as follows: marginal likelihood : vec(Y)|X q-EDND(0, ID (C + Σ)). (3) In the following, we view Y as the only data and X is merely regarded as the latent variable. Usually it is assumed Q D in the latent representation learning. Different from standard regression problems, a latent variable model (LVM) seeks to solve X in (2) or equivalently (3) for given data Y either deterministically or probabilistically (assuming a proper prior on X). Published as a conference paper at ICLR 2025 3 Q-EP LATENT VARIABLE MODEL GP-LVM is introduced by Lawrence (2003; 2005) as an unsupervised learning method for dimensionality reduction. It can be interpreted as a nonlinear extension of probabilistic PCA (Tipping & Bishop, 1999; Minka, 2000). The GP can be replaced by Q-EP to impose more regularization on the latent space, and hence we propose a Q-EP based LVM (QEP-LVM). 3.1 QEP-LVM AS A NON-LINEAR PROBABILISTIC PCA For convenience, we set Σ = β 1IN in the following. Let us start with linear latent function F = XW in the model (2) with W RQ D. The original probabilistic PCA (Tipping & Bishop, 1999) is formulated by integrating out the latent variable X: p(Y|W, β) = R p(Y|X, W, β)p(X)d X with prior p(X) = N(X; 0, INQ). The resulting marginal likelihood, p(Y|W, β) = N(Y; 0, (WTW+ β 1ID) IN), is maximized for the solution W , which spans the principal space of the data Y. Lawrence (2003) forms the dual problem of probabilistic PCA by marginalizing the weight parameter W as p(Y|X, β) = R p(Y|X, W, β)p(W)d W with prior p(W) = N(W; 0, IQD) and optimizing the resulting p(Y|X, β) = N(Y; 0, ID (XXT+β 1IN)) for the optimum, X , which is related to the eigen-decomposition of empirical covariance matrix YYT. Lawrence (2003; 2005) replace the above linear kernel with general ones, e.g. K = C + Σ as in (3), to learn a nonlinear latent reduction in X, essentially defining an LVM with GP mapping (GP-LVM). Now we impose a m.i.u. Q-EP prior on vec(W)|α q-ED(0, α 1IQD) that induces the following prior on the latent function F: vec(F) = (ID X)vec(W) q-ED(0, α 1ID XXT). The following marginal likelihood of Y|X can be obtained similar to (3), which defines a stochastic mapping from the latent space of X to the data space of Y: vec(Y)|X, α, β q-ED(0, ID K), K = α 1XXT + β 1IN. (4) Similarly as GP-LVM, (3), or (4) with general kernel K defines an LVM with Q-EP mapping (QEPLVM). Let r(Y) = vec(Y)T(ID K) 1vec(Y) = tr(K 1YYT). The log-likelihood of (4) is 2 log |K| + ND 2 1 log r(Y) 1 2r q 2 (Y). (5) The following theorem states that the maximum likelihood estimator (MLE) for X is equivalent to the solution for the dual probabilistic PCA (Tipping & Bishop, 1999; Minka, 2000). Theorem 3.1. Suppose YYT/D has eigen-decomposition UΛUT with Λ being the diagonal matrix with eigenvalues {λi}N i=1. Then the MLE for (5) is X = UQLV, L = diag({ p α(cλi β 1)}Q i=1), c(q) = D1 2 q (D Q) q 2(D Q) + (q 2)N where UQ is an N Q matrix with the first Q eigen-vectors in U, V is an arbitrary Q Q orthogonal matrix, and a b := min{a, b}. Proof. See Appendix A. Remark 3. When q = 2, the singular values of the latent variable X reduce to li = p α(λi β 1) corresponding to those for GP-LVM, though they were mistakenly stated as li = (α 1(λi β 1)) 1 2 in Lawrence (2003). When q > 0 varies, it regularizes the singular values (hence the latent space spaced by X) through c(q), whose properties are illustrated in Figure A.1. To detect complicated nonlinear latent representation, we consider the following automatic relevance determination (ARD) squared exponential (SE) kernel k( , ) (Titsias & Lawrence, 2010): K = [k(xn, xm)]N N, k(xn, xm) = α 1 exp 1 2(xn xm)T diag(γ)(xn xm) . (6) Then the locale of X can be determined by optimizing (5) with respect to X. The full solution to QEP-LVM also involves optimizing the kernel parameters (α, γ). Other kernels may lead to slightly different results, but the numerical conclusion (Section 4) is robust to the choice of kernels. Published as a conference paper at ICLR 2025 3.2 BAYESIAN QEP-LVM In this section we introduce Bayesian QEP-LVM and develop variational inference as Bayesian GPLVM (Titsias & Lawrence, 2010). Compared with the optimization method (Lawrence, 2003), the Bayesian training procedure is robust to over-fitting and can automatically determine the dimensionality of the nonlinear latent space (Titsias & Lawrence, 2010) e.g. by thresholding the correlation length γ. The derivation of variational Bayes for QEP-LVM is much more involved because the log-likelihood (5) is no longer presented as a quadratic form of data. Yet an appropriate evidence lower bound (ELBO) can still be obtained with Jensen s inequality. In addition to the likelihood model (4), we imposes a prior on the latent variable X and consider the following Bayesian QEP-LVM: marginal likelihood : vec(Y)|X q-ED(0, ID K), prior on latent variable : vec(X) q-ED(0, INQ). We use variational Bayes to approximate the posterior distribution p(X|Y) p(Y|X)p(X) with the variational distribution using an uncorrelated q-ED: variational distribution for latent variable : q(X) q-ED(µ, diag({Sn})), where each covariance Sn is of size Q Q and can be chosen as a diagonal matrix for convenience. Due to the equality of log-evidence, log p(Y) = KL(q(X) p(X|Y)) + L(q(X)), minimizing the KL divergence is equivalent to maximizing the ELBO L(q(X)) as follows: log p(Y) L(q(X)) := Z q(X) log p(Y|X)p(X) q(X) d X = L(q(X)) KL(q(X) p(X)), where the first term L(q(X)) = R q(X) log p(Y|X)d X =: log p(Y|X) q(X) is intractable and hence difficult to bound. 3.2.1 LOWER BOUND FOR THE MARGINAL LIKELIHOOD To address such intractability issue and to speed up the computation, sparse variational approximation (Titsias, 2009) is adopted by introducing a set of inducing points X RM Q with their function values U = [f1( X), , f D( X)] RM D. Hence the marginal likelihood p(Y|X) defined in (5) can be augmented to the following joint distribution each being a q-ED: p(Y|X) p(Y|F)p(F|U, X, X)p(U| X), where we have vec(U)| X q-ED(0, ID KMM) and the conditional distribution vec(F)|U, X, X q-ED(vec(KNMK 1 MMU), ID (KNN KNMK 1 MMKMN)). (7) The inducing points X are regarded as variational parameters and hence they are dropped from the following probability expressions. We then approximate p(F, U|X) p(F|U, X)p(U) with q(F, U) = p(F|U, X)q(U) in another variational Bayes as follows: log p(Y|X) Z q(F, U) log p(Y|F)p(F|U, X)p(U) q(F, U) d Fd U = Z p(F|U)q(U)d U log p(Y|F)d F + Z q(U) log p(U) q(U)d U. (8) Different from Titsias (2009); Titsias & Lawrence (2010) using the variational calculus, (SVGP Hensman et al., 2015) compute the marginal likelihood ELBO (8) through the variational distribution of latent function F. Instead of the variational free form, we follow Hensman et al. (2015) to use the variational distribution for U of the following format conjugate to p(F|U): variational distribution for inducing values : q(U) q-ED(M, diag({Σd})), where M is the variational mean. Noticing that F|U follows a conditional q-exponential (7), we can obtain the variational distribution of F, q(F), by marginalizing U out similarly as in (3): variational distribution for latent function : q(F) = Z q(F, U)d U = Z p(F|U)q(U)d U q-ED (vec(KNMK 1 MMM), ID (KNN KNMK 1 MMKMN) + diag({KNMK 1 MMΣd K 1 MMKMN})). Published as a conference paper at ICLR 2025 Therefore, the variational lower bound of the marginal likelihood (8) becomes log p(Y|X) log p(Y|F) q(F) KL(q(U) p(U)). Denote by log p(Y|F) = φ0(r(Y, F)), where φ0(r) := DN 2 log β + ND 2 1 log r 1 is convex for q (0, 2], and r(Y, F) = vec(Y F)T(β 1IND) 1vec(Y F) = βtr((Y F)(Y F)T) is a quadratic form of random variable Y. Therefore, by Jensen s inequality, we can bound from below log p(Y|F) q(F) = φ0(r(Y, F)) q(F) φ0( r(Y, F) q(F)). where we calculate the expectation of the quadratic form r(Y, F) as r(Y, F) q(F) =r(Y, KNMK 1 MMM) + βDtr(KNN KNMK 1 MMKMN) d=1 tr(KNMK 1 MMΣd K 1 MMKMN). Denote by h(Y, X) = log p(Y|F) q(F) q(X). Then by Jensen s inequality again we have h(Y, X) φ0( r(Y, F) q(F) q(X)) =: h (Y, X). Define ψ0 = tr( KNN q(X)), Ψ1 = KNM q(X), and Ψ2 = KMNKNM q(X). Further we calculate the expectations of quadratic terms similarly r(Y, F) q(F) q(X) =βD[ψ0 tr(K 1 MMΨ2)] + β d=1 tr(K 1 MMΣd K 1 MMΨ2) + r(Y, Ψ1K 1 MMM) + βtr(MTK 1 MM(Ψ2 ΨT 1 Ψ1)K 1 MMM). We compute the lower bounds for two K-L divergence terms KLU := KL(q(U) p(U)) and KLX := KL(q(X) p(X)) by similar argument with Jensen s inequality and expectation of quadratic forms. Details are left to Appendix B.2 for interested readers. 3.2.2 SUMMARY OF ELBO Denote by φ(r; Σ, D) := D 2 log |Σ| + ND 2 1 log r 1 2r q 2 . We summarize the final ELBO L (q(X)) as follows: log p(Y) L(q(X)) = Z q(X)q(U)p(F|U, X) log p(Y|F)p(U)p(X) q(U)q(X) d Fd Ud X L (q(X)) := h (Y, X) KL U KL X, h (Y, X) =φ(r Y; β 1IN, D), r Y =r(Y, Ψ1K 1 MMM) + βtr(MTK 1 MM(Ψ2 ΨT 1 Ψ1)K 1 MMM) + βD[ψ0 tr(K 1 MMΨ2)] + β d=1 tr(K 1 MMΣd K 1 MMΨ2), d=1 log |Σd| + φ tr(MTK 1 MMM) + d=1 tr(Σd K 1 MM); KMM, D n=1 log |Sn| + φ n=1 tr(Sn); IN, Q where ψ0 = tr( KNN q(X)), Ψ1 = KNM q(X), and Ψ2 = KMNKNM q(X) (Appendix B.1). Remark 4. The variational solution q(X) can be obtained by maximizing the ELBO (9) with respect to the variational parameters (µ, {Sn}, X, M, {Σd}) and kernel parameters (α, β, γ). Remark 5. When q = 2, φ(r; Σ, D) = D 2 log |Σ| 1 2r with r = r(Y, Ψ1K 1 MMM) becomes the log-density of matrix normal MN N D(Ψ1K 1 MMM, β 1IN, ID). Then the ELBO (9) reduces to Equation (7) of (SVGP Hensman et al., 2015) with an extra term βtr(MTK 1 MM(Ψ2 ΨT 1 Ψ1)K 1 MMM), not computed when q = 2 falling back to GP-LVM. The computational complexity, O(NM 2), remains the same as SVGP and GP-LVM (Titsias & Lawrence, 2010). Published as a conference paper at ICLR 2025 3.3 WHAT q? While the parameter q > 0 regularizes the learned latent representations (See Section 4), it remains as a question to automatically search for the optimal q , rather than manually pick one. Therefore, we impose a Gamma prior, q Γ(α0, β0), and jointly optimize its posterior for the optimal q . Note, the target function, ELBO (9), involves q only through the function φ, which is concave in q ( d2φ 2 log r)2 0). Therefore, if we block-update parameters in the overall optimization procedure, maximizing ELBO with respect to q should return a unique solution, though the joint densities could be rather complicated (See Figure C.1). In the numerical experiments (Section 4), we adopt α0 = 4 and β0 = 2 to make the prior mass concentrated in (1, 2). Figure B.1 investigates φ as a function of q that plays a key role in finding the optimal q . 3.4 PREDICTION With Bayesian QEP-LVM, we can make predictions for test data y . As in Bayesian GP-LVM (Titsias & Lawrence, 2010), we can compute the predictive density p(y |Y) for some test data y . Given the latent variables X obtained for the training data Y and a new test latent variable x , the predictive density can be written and approximated as follows R p(y , Y|X, x )p(X, x )d Xdx R p(Y|X)p(X)d X exp {L(q(X, x )) L(q(X))} (10) where L is the ELBO as in (9). Specifically, we train a Bayesian QEP-LVM based on Y to get L(q(X)). In the testing stage, we append y to Y on which we train another Bayesian QEPLVM with augmented variational distribution q(X, x ) q-ED([µ, µ ], diag({[Sn, S ]})) to obtain L(q(X, x )). The predictive density (10) can be utilized to build generative models to classify labels. Suppose in addition to data Y, we have labels t = {ti}N i=1 falling in K categories. Let Y(k) = {yi|ti = k}. Then we train K LVMs each based on one of {Y(k)}K k=1 and consider the following classifier: ˆt = arg max k p(y |Y(k))p(t = k). This will be further investigated in Section 4.2 and Section 4.3. 4 NUMERICAL EXPERIMENTS In this section, we demonstrate the regularization effect of QEP-LVM on latent representation learning through the parameter q > 0 and compare it with GP-LVM (q = 2). We investigate the latent representations of multiple datasets learned by QEP-LVMs with changing q > 0 and observe the regularization effect that smaller q tends to contract the latent space. Compared with GP-LVM as a QEP-LVM for fixed q = 2, optimizing the parameter q often leads to a superior latent representation with enhanced interpretability. Throughout this section, we use the kernel (6) and the Gamma prior Γ(4, 2) if varying q > 0 in the Bayesian framework. Since autoencoder is equivalent to PCA when all of its activation functions are linear, we also include the probabilistic version, variational autoencoder (VAE) (Kingma & Welling, 2014), for comparison. All the examples are implemented in GPy Torch (Gardner et al., 2018) and the computer codes are publicly available at https://github.com/lanzithinking/Reg_Rep. 4.1 SWISS ROLL First we consider the Swiss roll dataset (Marsland, 2014) usually used in manifold learning. Illustrated in upper left panel of Figure 1, this dataset consists of a 3d point cloud resembling the food with the same name. Because the 1000 points mainly sit on a curved surface, we aim to learn a 2d latent subspace with GP-LVM and QEP-LVM respectively. As shown in the lower left panel of Figure 1, PCA returns a result seemingly by compressing the 3d cloud from above (along z axis). We then train VAE with 5 dense layers and the resulting 2d latent representation identified by the two largest variances of latent distribution resembles that of PCA. QEP-LVMs are trained for Published as a conference paper at ICLR 2025 Figure 1: Latent representation of Swiss roll dataset. Upper left: 3d cloud of 1000 points; lower left: PCA in 2d principal space; right 4 columns: 2d latent representations (upper row) by VAE and QEP-LVMs with q = 1.0, 1.5 and 2.0 (GP-LVM) showing a regularization effect via the parameter q, and the corresponding variances of latent distribution (VAE) and inverse length-scales γ ordered on the x-axes (lower row). Colors are used to aid visualization but not for training. Figure 2: Latent representation of Swiss roll dataset output by QEP-LVM with optimal q = 1.39 (red line in the rightmost panel) found in the Bayesian framework. different qs with 25 inducing points. The upper row of right three panels in Figure 1 compares the latent representations output by QEP-LVMs with q = 1.0, 1.5 and 2.0 (GP-LVM) which all appear as projections from an outward view (along x axis). As q decreases from 2.0 to 1.0, the learned latent representations contract towards axes, verifying its regularization effect. The latent result with q = 1.5 is the best representing the latent roll structure among these plots. Most of the inverse length-scales (γ) of the kernel in the lower row imply 2 dominant dimensions (used to plot 2d latent representations), indicating an intrinsic dimensionality for this dataset. Next, we let q > 0 be a random variable imposed with a Gamma prior and jointly optimize q with other parameters in the QEP-LVM. Figure C.1 plots the pairwise densities of q and length-scale l = 1/γ, which highlights the difficulty of joint optimization (different scales, complex landscape, etc). As shown in the rightmost panel of Figure 2, the optimal regularization parameter is attained at q = 1.39. The resulting latent representation, as in the leftmost panel of Figure 2, reflects a clear latent roll structure. We dig a hole in the 3d point cloud along the curved surface to make a Swiss hole dataset in Appendix C.1. We also include latent representations by Isomap and t-SNE in Figure C.2 for comparison. Then we repeat the same experiments for such dataset and present similar results in Figure C.3 and Figure C.4. Again QEP-LVM with regularization parameter around 1.5 outputs latent representation having a roll with a hole structure better than GP-LVM and VAE. 4.2 OIL FLOW Next, we demonstrate the behavior of QEP-LVM and contrast it with GP-LVM using the canonical multi-phase oil-flow dataset (Bishop & James, 1993) that consists of 1000 observations (12- Published as a conference paper at ICLR 2025 Figure 3: Latent representation of oil flow dataset. Upper: 2d latent representations by VAE and QEP-LVMs with q = 2.0 (Gaussian), 1.5, 1.15 and 1.0 showing a regularization effect via the parameter q. Lower: the corresponding variances of latent distribution and inverse length-scales γ. Colors for points label three classes but are not used for training. Gray-scale colors indicate uncertainty approximated by the variational distribution. dimensional) belonging to three known classes corresponding to different phases of oil-flow. This dataset is also used in Lawrence (2003); Titsias & Lawrence (2010). We train the models only on the 12 covariates (no labels used) with the same settings as the above example in Section 4.1. The upper row of Figure 3 visualizes 2d slices of the latent spaces by VAE (identified by the largest two latent variances) and by QEP-LVMs corresponding to the most dominant latent dimensions (identified by the largest two inverse length-scales γ) with q = 2.0 (Gaussian), 1.5, 1.15 and 1.0 respectively. The gray-scale color indicates the uncertainty (computed from Sn in q(X)). Again we observe the regularization effect by the parameter q > 0 of QEP-LVM in latent representation learning: smaller q leads to more regularization on the learned representations and hence yields more aggregated clusters, as illustrated by the green class. In the lower row of Figure 3, except VAE, all the QEP-LVM models unanimously identify 3 dominant latent dimensions despite their orders. We also vary q > 0 with the same Gamma prior and obtain the optimal q = 1.15. The results are shown in Figure C.6. With each trained QEP-LVM, we utilize the predictive density to build a generative classifier as described in Section 3.4. Then we compare these classifiers on a common test dataset and repeat the experiments for 10 different random seeds. Table 1 summarizes the performance in terms of accuracy (ACC), area under ROC curve (AUC), adjusted rand index (ARI) score, normalized mutual information (NMI) score, log predictive probability (LPP) values, and running time (per class). The models with q = 1.0 and 1.15 are generally better than the other two with q = 1.5 and 2.0 (Gaussian) probably because the former two have labeled points more separated from each class in the latent subspace. The best classifier (ACC/AUC) coincides with the one for optimal q = 1.15. Note that LPPs are not comparable with each other since they are from different models with different likelihoods. Though not used in training, the class labels can be identified as cluster assignments. Then QEP-LVM with optimal q = 1.15 also attains the highest ARI and NMI scores which are metrics for evaluating clustering algorithms. Table 1: Classification on oil flow data based on learned latent representation: accuracy (ACC), area under ROC curve (AUC), adjusted rand index (ARI) score, normalized mutual information (NMI) score, log predictive probability (LPP) values, and running time (per class) by various Bayesian QEP-LVMs. Result in each cell are averaged over 10 experiments with different random seeds; values after are standard errors of these repeated experiments. Model (q) ACC AUC ARI NMI LPP time/class 1.0 0.70 0.04 0.86 0.03 0.35 0.05 0.36 0.05 -10.42 0.42 211.92 12.53 1.15 0.73 0.04 0.89 0.02 0.38 0.05 0.39 0.05 -9.21 0.25 212.30 13.17 1.5 0.67 0.05 0.83 0.03 0.30 0.07 0.33 0.06 -7.70 0.25 206.91 13.58 2.0(Gaussian) 0.68 0.05 0.83 0.03 0.34 0.07 0.37 0.07 -8.78 0.19 198.47 14.42 Published as a conference paper at ICLR 2025 Figure 4: Latent representations of MNIST database by VAE (left), GP-LVM (middle), QEP-LVM with q = 1.5 (right). For the convenience of visualization, 10-dimensional latent spaces learned by these algorithms are projected to 2-d subspace by t-SNE respectively. Lastly, we consider the MNIST database (Lecun et al., 1998) consisting of 60, 000 training and 10, 000 testing handwritten digits of size 28 28. We suppress the labels and use these images alone to train QEP-LVMs for q = 1.5 and 2.0 (Gaussian) with 128 inducing points. We set the latent dimension to be 10 and use t-SNE to project latent spaces by VAE, GP-LVM and QEP-LVM (q = 1.5) to 2d subspaces in Figure 4 for better visualization. Unlike VAE generating unstructured latent representation with all digits mixed up, outputs by LVMs are much more interpretable. While GP-LVM has segregated clusters of digits 6 and 7 respectively, QEP-LVM has more concentrated clusters with each further apart from others, despite confusion-prone groups 3-5-8 and 4-9-(7). Figure C.7 compares the pairwise (10 digits are divided into 5 groups of confusing pairs) latent representations by GP-LVM (lower) and QEP-LVM (q = 1.5, upper). QEP-LVM (q = 1.5) has latent representations better separated in the pairs of (0, 6), (2, 3) and (5, 8) compared with GP-LVM. Seen from Figure C.8, QEP-LVM (q = 1.5) generates sample digits of 3, 5 clearer than GP-LVM. Table 2 compares the generative classifiers constructed as in Section 3.4. QEP-LVM (q = 1.5) outperforms GP-LVM (q = 2.0) in terms of ACC and AUC. If we view the class labels (not used in training) as cluster assignments, QEP-LVM (q = 1.5) also has better ARI and NMI scores for clustering. Table 2: Classification on handwritten digits (MNIST) based on learned latent representation: accuracy (ACC), area under ROC curve (AUC), adjusted rand index (ARI) score, normalized mutual information (NMI) score, log predictive probability (LPP) values, and running time (per class) by various Bayesian QEP-LVMs. Results in each cell are averaged over 10 experiments with different random seeds; values after are standard errors of these repeated experiments. Model (q) ACC AUC ARI NMI LPP time/class 1.5 0.973 0.012 0.986 0.0065 0.943 0.024 0.952 0.020 -26453.5 481.9 129.91 4.60 2.0 (Gaussian) 0.965 0.012 0.981 0.0062 0.923 0.025 0.938 0.019 -279414.9 5184.6 127.18 0.29 5 CONCLUSION In this paper, we introduce a novel Bayesian latent variable model based on the recently proposed Qexponential process (Li et al., 2023) (QEP-LVM) for latent representation learning. Q-EP empowers the LVM with flexible regularization that controls the complexity of the learned latent representations often with improved interpretability compared with GP-LVM, a special case of QEP-LVM for q = 2. The theoretic connection between QEP-LVM and probabilistic PCA has been established. Using three examples, we demonstrate the advantage of the proposed methodology in terms of quality of learned latent representation and quantitative performance on the derived generative models. It remains unknown how to efficiently deal with missing data, one of the possible defects of the proposed QEP-LVM. Data imputation based on the variational distribution for latent function, q(f |X), has no tractable form when integrating with respect to q(X) as for GP-LVM (Equation (21) in Titsias & Lawrence (2010)). Perhaps Monte Carlo approximation is unavoidable. Another interesting direction is to generalize QEP-LVM to deep Q-EP as deep GP (Damianou & Lawrence, 2013), which we will treat in a separate paper. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS SL is supported by NSF grant DMS-2134256. Sumon Ahmed, Magnus Rattray, and Alexis Boukouvalas. Grand Prix: scaling up the Bayesian GPLVM for single-cell data. Bioinformatics, 35(1):47 54, 07 2018. ISSN 1367-4803. . Steven Atkinson and Nicholas Zabaras. Structured bayesian gaussian process latent variable model: Applications to data-driven dimensionality reduction and high-dimensional inversion. Journal of Computational Physics, 383:166 195, 2019. ISSN 0021-9991. . C.M. Bishop and G.D. James. Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 327(2):580 593, 1993. ISSN 0168-9002. . Zhengming Chen, Feng Xie, Jie Qiao, Zhifeng Hao, Kun Zhang, and Ruichu Cai. Identification of linear latent variable model with arbitrary distribution. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6350 6357, June 2022. ISSN 2159-5399. . Andreas Damianou and Neil D. Lawrence. Deep Gaussian processes. In Carlos M. Carvalho and Pradeep Ravikumar (eds.), Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, pp. 207 215, Scottsdale, Arizona, USA, 29 Apr 01 May 2013. PMLR. Juan Antonio Delgado-Guerrero, Adri A Colom A , and Carme Torras. Sample-efficient robot motion learning using gaussian process latent variable models. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 314 320, 2020. . K. Fang and Y.T. Zhang. Generalized Multivariate Analysis. Science Press, 1990. ISBN 9780387176512. Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Gregory Gundersen, Michael Zhang, and Barbara Engelhardt. Latent variable modeling with random features. In Arindam Banerjee and Kenji Fukumizu (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1333 1341. PMLR, 13 15 Apr 2021. James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process Classification. In Guy Lebanon and S. V. N. Vishwanathan (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 351 360, San Diego, California, USA, 09 12 May 2015. PMLR. Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: Learning attractor models for motor behaviors. Neural Computation, 25 (2):328 373, February 2013. ISSN 1530-888X. . I. T. Jolliffe. Principal Component Analysis. Springer Series in Statistics. Springer-Verlag, 1986. ISBN 0387954422. . Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202, April 2016. ISSN 1471-2962. . Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations. International Conference on Learning Representations, 2014. Published as a conference paper at ICLR 2025 Tore Selland Kleppe and Hans J. Skaug. Building and fitting non-gaussian latent variable models via the moment-generating function. Scandinavian Journal of Statistics, 35(4):664 676, 2008. . Neil Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In S. Thrun, L. Saul, and B. Sch olkopf (eds.), Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003. Neil Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. Journal of Machine Learning Research, 6(60):1783 1816, 2005. Thanh Le and Vasant Honavar. Dynamical gaussian process latent variable model for representation learning from longitudinal data. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, volume 21 of FODS -20, pp. 183 188. ACM, October 2020. . Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. . Shuyi Li, Michael O Connor, and Shiwei Lan. Bayesian learning via q-exponential process. In Proceedings of the 37th Conference on Neural Information Processing Systems. Neur IPS, 12 2023. Ying Li, Zhidi Lin, Feng Yin, and Michael Minyi Zhang. Preventing model collapse in Gaussian process latent variable models. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 28278 28308. PMLR, 21 27 Jul 2024. S. Marsland. Machine Learning: An Algorithmic Perspective, chapter 6. CRC Press, 2 edition, 2014. ISBN 9781420067194. Thomas Minka. Automatic choice of dimensionality for pca. In T. Leen, T. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. Hannes Nickisch and Carl Edward Rasmussen. Gaussian Mixture Modeling with Gaussian Process Latent Variable Models, pp. 272 282. Springer Berlin Heidelberg, 2010. ISBN 9783642159862. . Rok Pahiˇc, Zvezdan Lonˇcarevi c, Andrej Gams, and Aleˇs Ude. Robot skill learning in latent space of a deep autoencoder neural network. Robotics and Autonomous Systems, 135:103690, 2021. ISSN 0921-8890. . Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar Rao, and David Wipf. Variational em algorithms for non-gaussian latent variable models. In Y. Weiss, B. Sch olkopf, and J. Platt (eds.), Advances in Neural Information Processing Systems, volume 18. MIT Press, 2005. Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559 572, November 1901. ISSN 1941-5990. . Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005. . Saber Salehkaleybar, Amir Emad Ghassami, Negar Kiyavash, and Kun Zhang. Learning linear nongaussian causal models in the presence of latent variables. Journal of Machine Learning Research, 21(39):1 24, 2020. Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299 1319, 1998. . Michael E. Tipping and Christopher M. Bishop. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611 622, 09 1999. ISSN 1369-7412. . Published as a conference paper at ICLR 2025 Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In David van Dyk and Max Welling (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp. 567 574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR. Michalis Titsias and Neil D. Lawrence. Bayesian gaussian process latent variable model. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 844 851, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR. Raquel Urtasun and Trevor Darrell. Discriminative gaussian process latent variable model for classification. In Proceedings of the 24th international conference on Machine learning, ICML -07 & ILP -07, pp. 927 934. ACM, June 2007. . Feng Xie, Biwei Huang, Zhengming Chen, Yangbo He, Zhi Geng, and Kun Zhang. Identification of linear non-Gaussian latent hierarchical structure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 24370 24387. PMLR, 17 23 Jul 2022. Michael Minyi Zhang, Gregory W. Gundersen, and Barbara E. Engelhardt. Bayesian non-linear latent variable modeling via random fourier features. 06 2023. Published as a conference paper at ICLR 2025 Supplement Document for Bayesian Regularization of Latent Representation Proof of Theorem 3.1. The gradients of log-likelihood (5) with respect to K can be found as 2 q 2r q 2 1 K 1YYTK 1. (11) The MLE for X can be found by setting L KX = 0, which leads to 2 1 r 1 + q 2r q 2 1i D 1YYTK 1X. (12) Now suppose we have the formal solution for (12) as X = ULV, where L is an N Q matrix whose only nonzero entries {li} are on the main diagonal to be determined. Based on the formal solution of X, we have K = U(α 1LLT + β 1I)UT, r(Y) = tr(K 1YYT) = i=1 Dλi(α 1l2 i + β 1) 1. Denote by h(r) := ND q 2 1 r 1 + q 2r q 2 1. We substitute the above quantities into (12) and get ULV = UΛ(α 1LLT + β 1I) 1LVh(r), which reduces to li = λi(α 1l2 i + β 1) 1lih(r), i = 1, , D Q. Let h(r) = c with c to be determined. Assume li = 0. Then we can solve li = p α(cλi β 1). This yields r = c 1D(D Q). Hence 2 1 c(D Q) 1 + q 2[D(D Q)] q 2 1c1 q And it solves c = D1 2 q (D Q) h q 2(D Q)+(q 2)N i 2 q . Hence the proof is completed. Figure A.1: Function c(q) in Theorem 3.1 that regularizes the singular values of latent variable X. Left two panels: c(q) for fixed N, D, Q; right two panels: c(q) for fixed N and Q = D/10. Published as a conference paper at ICLR 2025 B VARIATIONAL BAYES FOR QEP-LVM B.1 EXPECTED KERNEL TERMS We compute ψ0 = tr( KNN q(X)), Ψ1 = KNM q(X), and Ψ2 = KMNKNM q(X) as follows: ψn 0 = Z k(xn, xn)q-ED(xn|µn, Sn)dxn, (Ψ1)nm = Z k(xn, zm)q-ED(xn|µn, Sn)dxn, (Ψn 2)mm = Z k(xn, zm)k(zm , xn)q-ED(xn|µn, Sn)dxn. With ARD SE kernel (6), we have ψ0 = Nα 1. While the integration in Ψ1 and Ψ2 is intractable in general, we can compute them using Monte Carlo approximation. Alternatively, we approximate (Ψ1)nm α 1 exp 1 2 (xn zm)T diag(γ)(xn zm) q(xn) = α 1 exp 1 2[(µn zm)T diag(γ)(µn zm) + tr(diag(γ)Sn)] , (Ψn 2)mm α 2 exp m=m,m (µn z m)T diag(γ)(µn z m)) + tr(diag(γ)Sn) If we use the ARD linear form, k(x, x ) = x T diag(γ)x , then we have ψn 0 = tr(diag(γ)(µnµT n + Sn)), (Ψ1)nm = µT n diag(γ)zm, (Ψn 2)mm = z T m diag(γ)(µnµT n + Sn) diag(γ)zm . B.2 LOWER BOUND FOR THE K-L DIVERGENCE ADDED TERMS We also need to compute the K-L divergence KLU := KL(q(U) p(U)): KLU = Z q(U) log q(U)d U Z q(U) log p(U)d U = Hq(U) log p(U) q(U). Denote by r = vec T(U M) T diag({Σd}) 1vec T(U M). Then log q(U) = 1 2 PD d=1 log |Σd| + MD 2 1 log r 1 2r q 2 . From (Proposition A.1. of Li et al., 2023) we know that r q 2 χ2(MD). Therefore d=1 log |Σd| + MD q H(χ2(MD)) + MD d=1 log |Σd| + MD 2 + log 2Γ MD Denote by φ1(r) := D 2 log |KMM| + MD 2 1 log r 1 2r q 2 . Then by Jensen s inequality log p(U) q(U) = φ1(tr(UTK 1 MMU)) q(U) φ1( tr(UTK 1 MMU) q(U)), tr(UTK 1 MMU) q(U) = tr(MTK 1 MMM) + d=1 tr(Σd K 1 MM). Therefore we have the K-L divergence added term KLU bounded by KLU KL U := 1 d=1 log |Σd| + φ0( tr(UTK 1 MMU) q(U)). Published as a conference paper at ICLR 2025 Lastly, we bound the K-L divergence term KLX := KL(q(X) p(X)) by similar argument: KLX KL X := 1 n=1 log |Sn| + φ2( tr(XTX) q(X)), where φ2(r) := NQ 2 1 log r 1 2r q 2 and tr(XTX) q(X) = tr(µTµ) + PN n=1 tr(Sn). Figure B.1: Log-likelihood function φ(q). Left two panels: φ(q) for fixed r, D; right two panels: φ(q) for fixed D. Published as a conference paper at ICLR 2025 C MORE NUMERICAL RESULTS C.1 SWISS HOLE Figure C.1: Pairwise densities between regularization parameter q and kernel length-scales l = (l1, l2, l3) in Swiss roll (left) and Swiss hole (right) problems. Figure C.2: Latent representation of Swiss roll (left two) and Swiss hole (right two) by Isomap and t-SNE respectively. Figure C.3: Latent representation of Swiss hole dataset. Upper left: 3d cloud of 1000 points; lower left: PCA in 2d principal space; right 3 columns: 2d latent representations (upper row) by VAE and QEP-LVMs with q = 1.0, 1.5 and 2.0 (GP-LVM) showing a regularization effect via the parameter q, and the corresponding variances of latent distribution (VAE) and inverse length-scales γ (lower row). Colors are used to aid visualization but not for training. Published as a conference paper at ICLR 2025 Figure C.4: Latent representation of Swiss hole dataset output by QEP-LVM with optimal q = 1.37 (red line in the rightmost panel) found in the Bayesian framework. C.2 ROBOT LEARNING In this section, we consider the robot learning problem in Pahiˇc et al. (2021). Robots can improve their performance by repeating desired behavior and updating the learned skill representation. So it is important to obtain the latent space of motor skills on which statistical and reinforcement learning can be applied to optimize robot s behavior. We focus on the task of robotic ball throwing at a target, which is realized by a 7 degree of freedom (DOF) arm Mitsubishi PA-10. Three DOFs of the robot, which contribute to its motion in the saggital plane, are used for the ball throwing. Dynamic movement primitives (DMPs) are used to represent smooth robot motion (Ijspeert et al., 2013) and with N = 20 DMP weights for each DOF we get 60 weights in the DMP parameter space. Here we test PCA, VAE, beta-VAE, and QEP-LVMs in learning the latent representation of motion skills. In this dataset, there are no labels. We compare the latent spaces generated by different algorithms in Figure C.5. We set latent dimension to 10 and visualize the 2d subspaces projected by t-SNE algorithm. VAE and QEP-LVM (q = 1.0) output more structured latent spaces than others. In contrast, the latent representations by beta-VAE and GP-LVM appear more dispersed and unstructured. Among all the algorithms, only beta-VAE and QEP-LVM (q = 1.0) correctly identify the intrinsic dimensionality 3 (DOFs) of the latent space. Figure C.5: Latent representation of robot learning dataset. Upper: 2d latent representations output by PCA, VAE, beta-VAE, GP-LVM and QEP-LVM (q = 1.0). Lower: latent dimensions indicated by dominant eigenvalues (PCA), variances of latent distribution (VAEs), and inverse length-scales γ in QEP-LVMs. For the convenience of visualization, 10-dimensional latent spaces learned by these algorithms are projected to 2-d subspace by t-SNE respectively. C.3 OIL FLOW Published as a conference paper at ICLR 2025 Figure C.6: Latent representation of oil flow dataset output by QEP-LVM with optimal q = 1.57 (red line in the rightmost panel) found in the Bayesian framework. Figure C.7: Pairwise latent representations of MNIST database by QEP-LVMs with q = 1.5 (upper) and q = 2.0 (Gaussian, lower). Figure C.8: Sample MNIST digits by QEP-LVMs with q = 1.5 (upper two rows) and q = 2.0 (Gaussian, lower two rows).