# learning_flat_latent_manifolds_with_vaes__7a4a3553.pdf

Learning Flat Latent Manifolds with VAEs

Nutan Chen 1 Alexej Klushyn 1 Francesco Ferroni 2 Justin Bayer 1 Patrick van der Smagt 1

Measuring the similarity between data points of ten requires domain knowledge, which can in parts be compensated by relying on unsupervised methods such as latent-variable models, where similarity/distance is estimated in a more compact latent space. Prevalent is the use of the Euclidean metric, which has the drawback of ignoring in formation about similarity of data stored in the decoder, as captured by the framework of Rie mannian geometry. We propose an extension to the framework of variational auto-encoders allows learning fat latent manifolds, where the Euclidean metric is a proxy for the similarity between data points. This is achieved by defning the latent space as a Riemannian manifold and by regularis ing the metric tensor to be a scaled identity matrix. Additionally, we replace the compact prior typ ically used in variational auto-encoders with a recently presented, more expressive hierarchical one and formulate the learning problem as a con strained optimisation problem. We evaluate our method on a range of data-sets, including a videotracking benchmark, where the performance of our unsupervised approach nears that of state-of the-art supervised approaches, while retaining the computational effciency of straight-line-based ap proaches.

1. Introduction

Measuring the distance between data points is a central ingredient of many data analysis and machine learning ap plications. Several kernel methods (Kernel PCA (Sch olkopf et al., 1997), Kernel NMF (Li & Ding, 2006), etc.), and other non-parametric approaches such as k-nearest neighbours (Altman, 1992) rely on the availability of a suitable distance

1Machine Learning Research Lab, Volkswagen Group, Munich, Germany 2Autonomous Intelligent Driving Gmb H, Munich, Germany. Correspondence to: Nutan Chen <nu tan.chen@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au thor(s).

function. Computer vision pipelines, e.g. tracking over time, perform matching based on similarity scores.

But designing a distance function can be challenging: it is not always obvious to write down mathematical formulae that accurately express a notion of similarity. Learning such functions has hence been proven as a viable alternative to manual engineering in this respect (NCA (Goldberger et al., 2005), metric learning (Xing et al., 2003), etc.). Often, these methods rely on the availability of pairs labelled as simi lar or dissimilar. A different route is that of exploiting the structure that latent-variable models learn. The assumption that a set of high-dimensional observations is explained by points in a much simpler latent space underpins these ap proaches. In their respective probabilistic versions, a latent prior distribution is transformed non-linearly to give rise to a distribution of observations. The hope is that simple distances, such as the Euclidean distance measured in la tent space, implement a function of similarity. Yet, these approaches do not incorporate the variation of the obser vations with respect to the latent points. For example, the observations will vary much more when a path in latent space will cross a class boundary.

In fact, recent approaches to non-linear latent variable mod els, such as the generative adversarial network (Goodfellow et al., 2014) or the variational auto-encoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014), regularise the latent space to be compact, i.e. to remove low-density regions. This is in contrast to the aforementioned hope that Euclidean distances appropriately refect similarity.

The above insight leads us to the development of fat mani fold variational auto-encoders. This class of VAEs defnes the latent space as Riemannian manifold and regularises the Riemannian metric tensor to be a scaled identity matrix. In this context, a fat manifold is a Riemannian manifold, which is isometric to the Euclidean space. To not compro mise the expressiveness, we relax the compactness assump tion and make use of a recently introduced hierarchical prior (Klushyn et al., 2019). As a consequence, the model is capa ble of learning a latent representation, where the Euclidean metric is a proxy for the similarity between data points. This results in a computational effcient distance metric which is practical for applications in real-time scenarios.

Learning Flat Latent Manifolds with VAEs

2. Variational Auto-Encoders with Flat Latent Manifolds

2.1. Background on Learning Hierarchical Priors in VAEs

Latent-variable models are defned as

Z p(x) = p(x|z) p(z) dz, (1)

where z RNz represents latent variables and x RNx the observable data. The integral in Eq. (1) is usually intractable but it can be approximated by maximising the evidence lower bound (ELBO) (Kingma & Welling, 2014; Rezende et al., 2014):

h Ep D (x) log pθ(x) Ep D (x) Eqφ(z|x) log pθ(x|z)

i KL qφ(z|x)k p(z) , (2)

PN 1 where p D(x) = δ(x xi) is the empirical distri N i=1 bution of the data D = {xi}i

=1. The distribution param eters of the approximate posterior qφ(z|x) and the likeli hood pθ(x|z) are represented by neural networks. The prior p(z) is usually defned as a standard normal distribution. This model is commonly referred to as the variational autoencoder (VAE).

However, a standard normal prior often leads to an over regularisation of the approximate posterior, which results in a less informative learned latent representation of the data (Tomczak & Welling, 2018; Klushyn et al., 2019). To en able the model to learn an informative latent representation, Klushyn et al. (2019) propose to use a fexible hierarchical R prior pΘ(z) = pΘ(z|ζ) p(ζ) dζ, where p(ζ) is the stan dard normal distribution. Since the optimal prior is the aggregated posterior (Tomczak & Welling, 2018), the above integral is approximated by an importance-weighted (IW) bound (Burda et al., 2015) based on samples from qφ(z|x). This leads to a model with two stochastic layers and the following upper bound on the KL term:

Ep D (x) KL qφ(z|x)k p(z) F(φ, Θ, Φ)

Ep D (x) Eqφ(z|x) log qφ(z|x)

h K i X 1 pΘ(z|ζi) p(ζi) Eζ1:K qΦ(ζ|z) log , (3) K qΦ(ζi|z) i=1

where K is the number of importance samples. Since it has been shown that high ELBO values do not necessar ily correlate with informative latent representations (Alemi et al., 2018; Higgins et al., 2017) which is also the case for hierarchical models (Sønderby et al., 2016) different optimisation approaches have been introduced (Bowman

et al., 2016; Sønderby et al., 2016). Klushyn et al. (2019) follow the line of argument in (Rezende & Viola, 2018) and reformulate the resulting ELBO as the Lagrangian of a constrained optimisation problem:

LVHP(θ, φ, Θ, Φ; λ)

F(φ, Θ, Φ) + λ Ep D (x) Eqφ(z|x) Cθ(x, z) κ2 , (4)

with the optimisation objective F(φ, Θ, Φ), the inequal ity constraint Ep D (x) Eqφ(z|x) Cθ(x, z) κ2 , and the Lagrange multiplier λ. Cθ(x, z) is defned as the reconstruction-error-related term in log pθ(x|z). Thus, we obtain the following optimisation problem:

min min max min LVHP(θ, φ, Θ, Φ; λ) s.t. λ 0. (5)

Building on that, the authors propose an optimisation algorithm including a λ-update scheme to achieve a tight lower bound on the log-likelihood. This approach is referred to as variational hierarchical prior (VHP) VAE.

2.2. Learning Flat Latent Manifolds with VAEs

The VHP-VAE is able to learn a latent representation that corresponds to the topology of the data manifold (Klushyn et al., 2019). However, it is not guaranteed that the (Eu clidean) distance between encoded data in the latent space is a suffcient distance metric in relation to the observa tion space. In this work, we aim to measure the dis tance/difference of observed data directly in the latent space by means of the Euclidean distance of the encodings.

Chen et al. (2018a); Arvanitidis et al. (2018) defne the latent space of a VAE as a Riemannian manifold. This approach allows for computing the observation-space length of a trajectory γ : [0, 1] RNz in the latent space:

Z 1 q L(γ) = γ (t)T G γ(t) γ (t) dt, (6) 0

where G RNz Nz is the Riemannian metric tensor, and γ (t) the time derivative. We defne the observation-space distance as the shortest possible path

D = min L(γ) (7) γ

between two data points. The trajectory γ = arg minγ L(γ) that minimises L(γ) is referred to as the (minimising) geodesic. In the context of VAEs, γ is transformed by a con tinuous function f(γ(t)) the decoder to the observation space. The metric tensor is defned as G(z) = J(z)T J(z), where J is the Jacobian of the decoder.

To measure the observation-space distance directly in the latent space, distances in the observation space should be proportional to distances in the latent space:

D kz(1) z(0)k2, (8)

Learning Flat Latent Manifolds with VAEs

where we defne the Euclidean distance as the distance metric. This requires that the Riemannian metric tensor is G 1. As a consequence, the Euclidean distance in the latent space corresponds to the observation-space distance. We refer to a manifold with this property as fat manifold (Lee, 2006). To obtain a fat latent manifold, the model typ ically needs to learn complex latent representations of the data (see experiments in Sec. 4). Therefore, we propose the following approach: (i) to enable our model to learn com plex latent representations, we apply a fexible prior (VHP), which is learned by the model (empirical Bayes); and (ii) we regularise the curvature of the decoder such that G 1.

For this purpose, the VHP-VAE, introduced in Sec. 2.1, is extended by a Jacobian-regularisation term. We defne the regularisation term as part of the optimisation objective, which is in line with the constrained optimisation setting. The resulting objective function is

L(θ, φ, Θ,Φ; λ, η, c2) = LVHP(θ, φ, Θ, Φ; λ) +

η Ep D (x) Eqφ(z|x) k G(z) c 21k2

where η is a hyper-parameter determining the infuence of the regularisation and c the scaling factor. Additionally, we use a stochastic approximation (frst order Taylor expansion) of the Jacobian to improve the computational effciency (Rifai et al., 2011b):

1 Jt(z) = lim Eϵ N (0,σ2) f(z + ϵ et) f(z) , (10) σ 0 σ

where Jt RNx is the Jacobian of the t-th latent dimension and et a standard basis vector. This approximation method allows for a faster computation of the gradient and avoids the second-derivative problem of piece-wise linear layers (Chen et al., 2018a).

However, the infuence of the regularisation term in Eq. (9) on the decoder function is limited to regions where data is available. To overcome this issue, we propose to use mixup, a data-augmentation method (Zhang et al., 2018), which was introduced in the context of supervised learning. We extend this method to the VAE framework (unsupervised learning) by applying it to encoded data in the latent space. This approach allows augmenting data by interpolating between two encoded data points zi and zj :

g(zi, zj ) = (1 α) zi + α zj , (11)

with xi, xj p D(x), zi qφ(z|xi), zj qφ(z|xj ), and α U( α0, 1 + α0). In contrast to (Zhang et al., 2018), where α [0, 1] limits the data augmentation to only convex combinations, we defne α0 > 0 to take into account the outer edge of the data manifold. By combin ing mixup in Eq. (11) with Eq. (9), we obtain the objective

function of our fat manifold VAE (FMVAE):

LVHP-FMVAE(θ, φ, Θ, Φ; λ, η, c2) = LVHP(θ, φ, Θ, Φ; λ) +

η Exi,j p D (x) Ezi,j qφ(z|xi,j ) k G(g(zi, zj )) c 21k2 .

Inspired by batch normalisation, we defne the squared scal ing factor to be the mean over the batch samples and diago nal elements of G (see App. A.2 for empirical support):

1 2 c = Exi,j p D (x) Ezi,j qφ(z|xi,j ) tr(G(g(zi, zj ))) . Nz

The optimisation algorithm Alg. 1, and further details about the optimisation process can be found in App. A.4.

By using augmented data, we regularise G to be a scaled identity matrix for the entire latent space enclosed by the data manifold. As a consequence, the function f(z) (de coder) is up to the scaling factor c distance-preserving since Dx(f(zi), f(zj )) c Dz(zi, zj ), where Dx and Dz refer to the distance in the observation and latent space, respectively.

3. Related Work

Interpretation of the VAE s latent space. In general, the latent space of VAEs is considered to be Euclidean (e.g. Kingma et al., 2016; Higgins et al., 2017), but it is not con strained to be Euclidean. This can be problematic if we require a precise metric that is based on the latent space. Some recent works (Mathieu et al., 2019; Grattarola et al., 2018) adapted the latent space to be non-Euclidean to match the data structure. We solve the problem from another per spective: we enforce the latent space to be Euclidean.

Jacobian and Hessian regularisation. In (Rifai et al., 2011a), the authors proposed to regularise the Jacobian and Hessian of the encoder. However, it is more diffcult to augment data in the observation space than in the latent space. In (Hadjeres et al., 2017), the Jacobian of the de coder was regularised to be as small as possible/zero. On the contrary, we regularise the the Riemannian metric tensor to be a scaled identity matrix, and hence the Jacobian to be constant, and hence the Hessian to be zero. (Nie & Patel, 2019) regularised the Jacobian with respect to the weights for GANs. In terms of supervised learning, (Jakubovitz & Giryes, 2018) used Jacobian regularisation to improve the robustness for classifcation.

Metric learning. Various metric learning approaches for both deep supervised and unsupervised models were pro posed. For instance, deep metric learning (Hoffer & Ailon, 2015) used a triplet network for supervised learning. (Kar aletsos et al., 2016) introduced an unsupervised metric learn ing method, where a VAE is combined with triplets. How

Learning Flat Latent Manifolds with VAEs

ever, a human oracle is still required. By contrast, our approach is completely based on unsupervised learning, using the Euclidean distance in the latent space as a dis tance metric. Our proposed method is similar to the metric learning methods such as Large Margin Nearest Neighbour (Weinberger & Saul, 2009), which pulls target neighbours together and pushes impostors away. The difference is that our approach is an unsupervised method.

Constraints in latent space. Constraints on time (e.g. Wang et al., 2007; Chen et al., 2016; 2015) allow obtaining similar distance metrics in the latent space. Additionally, due to the missing data between different sequence steps, constraints on time cannot guarantee that the metric is cor rect between different sequences. However, our method can be used for general data-sets.

Data augmentation. The latent space is formed arbitrarily in regions where data is missing. Zhang et al. (2018) pro posed mixup, an approach for augmenting data and labels for supervised learning. Various follow-up studies of mixup were developed, such as (Verma et al., 2019; Beckham et al., 2019). (Verma et al., 2019) considered mixup of hidden representations of training data to fatten the class-specifc state distribution. We extend mixup to the VAE framework (unsupervised learning) by applying it to encoded data in the latent space of generative models. This facilitates the regularisation of regions where no data is available. As a consequence, similarity of data points can be measured in the latent space by applying the Euclidean metric.

Geodesic. Recent studies on geodesics for generative mod els (e.g. Tosi et al., 2014; Chen et al., 2018a; Arvanitidis et al., 2018) are focusing on methods for computing/fnding the geodesic in the latent space. By contrast, we use the geodesic/Riemannian distance for infuencing the learned latent manifold. (Frenzel et al., 2019) projected the latent space to a new latent space, where the geodesic is equivalent to the Euclidean interpolation. However, these two sepa rate processes VAEs and projection probably hinder the model to fnd the latent features autonomously. Another dif ference is the assumption of previous work is that distances, defned by geodesics, can only be measured by following the data manifold. This is useful in scenarios such as avoiding unseen barriers between two data points, e.g., (Chen et al., 2018b), but it does not allow measuring distances between different categories. In this work, we focus on learning a general distance metric.

4. Experiments

We test our method on artifcial pendulum images, human motion data, MNIST, and MOT16. We measure the perfor mance in terms of equidistances, interpolation smoothness, and distance computation. Additionally, our method is ap

plied to a real-world environment a video-tracking bench mark. Here, the tracking and re-identifcation capabilities are evaluated.

The Riemannian metric tensor has many intrinsic proper ties of a manifold and measures local angles, length, sur face area, and volumes (Bronstein et al., 2017). Therefore, the models are quantifed based on the Riemannian metric tensor by computing condition numbers and magnifcation factors. The condition number, which shows the ratio of the most elongated to the least elongated direction, is defned as

Smax(G) k(G) = , where Smax is the largest eigenvalue of G. Smin(G) p The magnifcation factor MF(z) det G(z) (Bishop et al., 1997) depicts the sensitivity of the likelihood func tions. When projecting from the Riemannian (latent) to the Euclidean (observation) space, the MF can be considered a scaling coeffcient. Since we cannot directly compare the MFs of different models, the MFs are normalised/divided by their means. The closer the conditional number and the normalised MF are to one, the more invariant is the model with respect to the Riemannian metric tensor. In other words: the conditional number and the normalised MF are metrics to evaluate whether G(z) is approximately constant and proportional to 1.

In order to make the visualisations of the magnifcation fac tor in Sec. 4.1 (Fig. 1) and Sec. 4.2 (Fig. 3 & Fig. 7) compara ble, we defne the respective upper range of the colour-bar as

max(MFVAE-VHP(grid area)) mean(MFVHP-FMVAE(data)) . MF(data) and mean(MFVHP-FMVAE(data)) MF(grid area) are computed with training data and by us ing a grid area, respectively.

To be in line with previous literature (e.g. Higgins et al., 2017; Sønderby et al., 2016), we use the β-parametrisation

1 of the Lagrange multiplier β = in our experiments. λ

4.1. Artifcial Pendulum Data-set

The pendulum data-set (Klushyn et al., 2019; Chen et al., 2018a) consists of 16 16-pixel images generated by a pen dulum simulator. The pendulum has one degree of freedom, and the joint is located in the centres of the images. We generated 15 103 images with joint angles uniformly in the ranges of [0, 360) degrees. Additionally, we added 0.05 Gaussian noise to each pixel.

As seen in Fig. 1, without regularisation, the contour lines are denser in the centre of the latent space. The reason is that, in contrast to the VHP-VAE, the regularisation term in the VHP-FMVAE smoothens the latent space (G c 1) visualised by the MF and the equidistance plots. In Fig. 2, VHP-FMVAE and VAE-VHP are compared in terms of condition number and normalised MF. In both cases the VHP-FMVAE outperforms the VHP-VAE.

Learning Flat Latent Manifolds with VAEs

magniﬁcation factor [log scale]

rotation angle

(a) VHP-FMVAE

magniﬁcation factor [log scale]

rotation angle

(b) VHP-VAE

Figure 1. Latent representation of pendulum data: the contour plots illustrate curves of equal observation-space distance to the respec tive encoded data point. Distances are calculated using Eq. (6). The grey-scale displays MF(z). Note: round, homogeneous contour plots indicate that G(z) 1.

condition number

normalised MF

Figure 2. Pendulum data: if both the condition number and the normalised MF values are close to one, it indicates that G(z) 1. The box-plots are based on 1,000 generated samples.

4.2. Human Motion Capture Database

To evaluate our approach on the CMU human motion data set (http://mocap.cs.cmu.edu), we select fve dif ferent movements: walking (subject 35), jogging (subject 35), balancing (subject 49), punching (subject 143), and kicking (subject 74). After data pre-processing, the input data is a 50-dimensional vector of the joint angles. Note that the data-set is not balanced: walking, for example, has

more data points than jogging.

equidistance Euclidean interpolation

geodesic walking

balancing jogging

punching kicking

magniﬁcation factor [log scale]

magniﬁcation factor [log scale]

(a) VHP-FMVAE

(b) VAE-VHP

Figure 3. Latent representation of human motion data: the contour plots illustrate curves of equal observation-space distance to the respective encoded data point. The grey-scale displays MF(z). Note: round, homogeneous contour plots indicate that G(z) 1. In case of the VHP-FMVAE (a), Jogging is a large-range movement compared with walking, so that jogging is reasonably distributed on a larger area in the latent space than walking. By contrast, in case of the VHP-VAE (b), the latent representation of walking is larger than the one of jogging. Additionally, geodesics are compared to the corresponding Euclidean interpolations. The Euclidean interpolations in (a) are much closer to the geodesics.

Table 1. Verifcation of the distance metric. The table shows the length ratio of the Euclidean interpolation to the geodesic. Addi tionally, we list the ratio of the related distances in the observation space.

DATA-SET METHOD OBSERVATION LATENT

HUMAN VHP-FMVAE 1.02 0.06 0.93 0.03 VHP-VAE 1.23 0.20 0.82 0.10

MNIST VHP-FMVAE 1.01 0.08 0.92 0.05 VHP-VAE 1.13 0.22 0.70 0.31

Equidistance plots. In Fig. 3, we randomly select a data point from each class as centres of the equidistance plots.

Learning Flat Latent Manifolds with VAEs

VHP-FMVAE w/o c21

VHP-FMVAE w/o mixup

condition number

VHP-FMVAE w/o c21

VHP-FMVAE w/o mixup

normalised MF

Figure 4. Human motion data: if both the condition number and the normalised MF values are close to one, it indicates that G(z) 1. The box-plots are based on 3,000 generated samples.

10 20 30 40 50 joint index

smoothness factor

left leg right leg head and torso left arm right arm

VHP-FMVAE VHP-VAE

Figure 5. Smoothness measure of the human-movement interpo

lations. The mean and standard deviation are displayed for each joint: the smaller the value, the smoother the interpolation.

(a) VHP-FMVAE (b) VHP-VAE

Figure 6. Human-movement reconstructions of Euclidean inter polations in the latent space. Discontinuities in the motions are marked by blue boxes.

In case of our proposed method, the equidistance plots are homogeneous, while in case of the VHP-VAE, the equidis tance contour lines are distorted in regions of high MF values. Thus, the mapping from latent to observation space learned by the VHP-FMVAE is approximately distance pre serving. Additionally, we use the condition number and the normalised MF to evaluate G based on 3,000 random samples. In contrast to the VHP-VAE, both the condition number and the normalised MF values of the VHP-FMVAE are close to one, which indicates that G(z) 1.

Smoothness. We randomly sample 100 pair points and

magniﬁcation factor [log scale]

(a) VHP-FMVAE without mixup

magniﬁcation factor [log scale]

(b) VHP-FMVAE without the identity term c 21

Figure 7. Infuence of the data augmentation and the identity term c 21 on the learned latent representation of human movement data. The movements are coloured as in Fig. 3. (a) If not applying mixup, regions, where data is missing (e.g., between two move ments), have a high MF and distorted equidistance contours. (b) regularising the metric tensor, and hence the Jacobian to be zero, does not allow the model to learn a fat latent manifold. The equidistance contours are scaled differently at various locations in the latent space. Without c 21 term as in (Hadjeres et al., 2017), it cannot reduce the distance for points with high similarities. For in stance, the walking is not squeezed as in Fig. 3a in the latent space. Therefore, the walking is not distributed smaller than jogging.

linearly interpolate between each pair. The second derivative of each trajectory is defned as the smoothness factor. Fig. 5 illustrates that the VHP-FMVAE signifcantly outperforms the VAE-VHP in terms of smoothness. Fig. 6 shows fve examples of the interpolated trajectories.

Verifcation of the distance metric. To verify that the Euclidean distance in the latent space corresponds to the geodesic distance, we approximates the geodesic by using a graph-based approach (Chen et al., 2019). The graph of the baseline has 14,400 nodes, which are sampled in the latent space using a uniform distribution. Each node has 12 neigh bours. In Fig. 3, fve geodesics each are compared to the corresponding Euclidean interpolations. Tab. 1 shows the ratios of Euclidean distances in latent space to geodesics dis tances, as well as the related ratios in the observation space. To compute the ratios, we randomly sampled 100 pairs of points and interpolated between each pair. If the ratio of the distances is close to one, the Euclidean interpolation approximates the geodesic. The VHP-FMVAE outperforms

Learning Flat Latent Manifolds with VAEs

the VAE-VHP.

Infuence of the data augmentation and the identity term c21. Fig. 4 and Fig. 7a show the infuence of the data augmentation (see Sec. 2.2). Without data augmenta tion, the infuence of the regularisation term is limited to regions where data is available, as verifed by the high MF values between the different movements. As an additional experiment, Fig. 4 and Fig. 7b illustrates the infuence of the identity term c21. If we remove it, the regularisation term becomes k G(g(zi, zj ))k2

2. As a consequence, the model is not able to learn a fat latent manifold.

The binarised MNIST data-set (Larochelle & Murray, 2011) consists of 50,000 training and 10,000 test images of hand written digits (zero to nine) with 28 28 pixels in size.

400 200 0 200 z1

0 50 100 150 200 z1

0 1 2 3 4 5 6 7 8 9

(a) VHP-FMVAE (b) VHP-VAE

Figure 8. Latent representation of MNIST data: the contour plots illustrate curves of equal observation-space distance to the respec tive encoded data point (denoted by a black dot).

Both of our evaluation metrics the condition number and the normalised MF show that the VHP-FMVAE outperforms the VAE-VHP (see Fig. 8 and Fig. 9). In contrast to the

condition number

normalised MF

Figure 9. MNIST data: if both the condition number and the nor

malised MF values are close to one, it indicates that G(z) 1. The box-plots are based on 10,000 generated samples.

VHP-VAE, the VHP-FMVAE learns a latent space, where Euclidean distances are close to geodesic distances (see Tab. 1). This indicates that G(z) is approximately constant.

4.4. MOT16 Object-Tracking Database

We evaluate our approach on the MOT16 object-tracking database (Milan et al., 2016), which is a large-scale person re-identifcation data-set, containing both static and dynamic scenes from diverse cameras.

(b) Deep SORT

(c) VHP-VAE-SORT

(d) VHP-FMVAE-SORT with η = 3000

Figure 10. Example identity switches between overlapping tracks. For vanilla SORT, track 3260 gets occluded and when subsequently visible, it gets assigned a new ID 3421. For dee SORT and VHP VAE-SORT, the occluding track gets assigned the same ID as the track it occludes (42/61), and subsequently keeps this (erroneous) track. For VHP-FMVAE-SORT, the track 42 gets occluded, but is re-identifed correctly when again visible.

Learning Flat Latent Manifolds with VAEs

Table 2. Comparisons between different descriptors for the purposes of object tracking and re-identifcation (Ristani et al., 2016). The

bold and the red numbers denote the best results among all methods and among non-supervised methods, respectively.

METHOD TYPE IDF1 IDP IDR RECALL PRECISION FAR MT

VHP-FMVAE-SORT η = 300 (OURS) UNSUPERVISED 63.7 77.0 54.3 65.0 92.3 1.12 158 VHP-FMVAE-SORT η = 3000 (OURS) UNSUPERVISED 64.2 77.6 54.8 65.1 92.3 1.13 162 VHP-VAE-SORT UNSUPERVISED 60.5 72.3 52.1 65.8 91.4 1.28 170

SORT N.A. 57.0 67.4 49.4 66.4 90.6 1.44 158

DEEPSORT SUPERVISED 64.7 76.9 55.8 66.7 91.9 1.22 180

METHOD PT ML FP FN IDS FM MOTA MOTP MOTAL

VHP-FMVAE-SORT η = 300 (OURS) 269 90 5950 38592 616 1143 59.1 81.8 59.7 VHP-FMVAE-SORT η = 3000 (OURS) 265 90 6026 38515 598 1163 59.1 81.8 59.7 VHP-VAE-SORT 266 81 6820 37739 693 1264 59.0 81.6 59.6

SORT 275 84 7643 37071 1486 1515 58.2 81.9 59.5

DEEPSORT 250 87 6506 36747 585 1165 60.3 81.6 60.8

We compare with two baselines: SORT (Bewley et al., 2016) and Deep SORT (Wojke et al., 2017). SORT is a simple online and real-time tracking method, which uses bounding box intersection-over-union (IOU) for associating detections between frames and Kalman flters for the track predictions. It relies on good two-dimensional bounding box detections from a separate detector, and suffers from ID switching when tracks overlap in the image. Deep SORT extends the original SORT algorithm to integrate appearance informa tion based on a deep appearance descriptor, which helps with re-identifcation in the case of such overlaps or missed detections. The deep appearance descriptor is trained us ing a supervised cosine metric learning approach (Wojke & Bewley, 2018). The candidate object locations of the pre-generated detections for both SORT, Deep SORT and our method are taken from (Yu et al., 2016). Further details regarding the implementation can be found in App. A.3.

We use the following metrics for evaluation. indicates that the higher the score is, the better the performance is. On the contrary, indicates that the lower the score is, the better the performance is.

IDF1( ): ID F1 Score FN( ): False Negatives IDP( ): ID Precision IDs( ): Number of times IDR( ): ID Recall an ID switches to a different FAR( ): False Alarm Rapreviously tracked object tio FM( ): Fragmentations MT( ): Mostly Tracked MOTA( ): Multi-object Trajectory tracking accuracy PT( ): Partially Tracked MOTP( ): Multi-object Trajectory tracking precision ML( ): Mostly Lost Tra MOTAL( ): Log tracking jectory accuracy FP( ): False Positives

Tab. 2 shows that the performance of the proposed method is better than that of the model without Jacobian regularisa tion, and even close to the the performance of supervised learning. All methods depend on the same underlying de tector for object candidates, and identical Kalman flter parameters. Compared to baseline SORT which does not utilise any appearance information, Deep SORT has 2.54 times, VHP-VAE-SORT has 2.14 times, VHP-FMVAE SORT (η = 300) has 2.41 times and VHP-FMVAE-SORT (η = 3000) has 2.48 times fewer ID switches. Whilst the supervised Deep SORT descriptor has the least, using un supervised VAEs with fat decoders has only 2.2% more switches, without the need for labels. Furthermore, by en suring a quasi-Euclidean latent space, one can query nearest neighbours effciently via data-structures such as k DTrees. Fig. 10 shows an example of the results. In other examples of the videos, the VHP-FMVAE-SORT works similar as the Deep SORT. Videos of the results can be downloaded at: http://tiny.cc/0s71cz

5. Conclusion

In this paper, we have proposed a novel approach, which we call fat manifold variational auto-encoder. We have shown that this class of VAEs learns a latent representation, where the Euclidean metric is a proxy for the similarity between data points. This is realised by interpreting the latent space as a Riemannian manifold and by combining a powerful empirical Bayes prior with a regularisation method that constrains the Riemannian metric tensor to be a scaled identity matrix. Experiments on several datasets have shown the effectiveness of our proposed algorithm for measuring similarity. In case of the MOT16 object-tracking database, the performance of our unsupervised method nears that of state-of-the-art supervised approaches.

Learning Flat Latent Manifolds with VAEs

Acknowledgements

We would like to thank Botond Cseke and Alexandros Paraschos for helpful discussions.

Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous,

R. A., and Murphy, K. Fixing a broken ELBO. ICML, 2018.

Altman, N. S. An introduction to kernel and nearestneighbor nonparametric regression. The American Statis tician, 46(3):175 185, 1992.

Arvanitidis, G., Hansen, L. K., and Hauberg, S. Latent space oddity: on the curvature of deep generative models. In ICLR, 2018.

Beckham, C., Honari, S., Lamb, A. M., Verma, V., Ghadiri, F., Hjelm, R. D., Bengio, Y., and Pal, C. On adversarial mixup resynthesis. Neur IPS, 2019.

Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. Simple online and realtime tracking. In IEEE ICIP, pp. 3464 3468, 2016.

Bishop, C. M., Svens en, M., and Williams, C. K. Mag

nifcation factors for the SOM and GTM algorithms. In Proceedings Workshop on Self-Organizing Maps, 1997.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefow

icz, R., and Bengio, S. Generating sentences from a continuous space. Co NLL, 2016.

Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Van

dergheynst, P. Geometric deep learning: going beyond Euclidean data. IEEE Signal Processing Magazine, 34 (4):18 42, 2017.

Burda, Y., Grosse, R. B., and Salakhutdinov, R. Importance weighted autoencoders. Co RR, abs/1509.00519, 2015.

Chen, N., Bayer, J., Urban, S., and Van Der Smagt, P. Ef

fcient movement representation by embedding dynamic movement primitives in deep autoencoders. In IEEE RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 434 440, 2015.

Chen, N., Karl, M., and van der Smagt, P. Dynamic move

ment primitives in latent space of time-dependent vari ational autoencoders. In IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pp. 629 636, 2016.

Chen, N., Klushyn, A., Kurle, R., Jiang, X., Bayer, J., and van der Smagt, P. Metrics for deep generative models. In AISTATS, pp. 1540 1550, 2018a.

Chen, N., Klushyn, A., Paraschos, A., Benbouzid, D., and van der Smagt, P. Active learning based on data uncer tainty and model sensitivity. IEEE/RSJ IROS, 2018b.

Chen, N., Ferroni, F., Klushyn, A., Paraschos, A., Bayer, J., and van der Smagt, P. Fast approximate geodesics for deep generative models. In ICANN, 2019.

Frenzel, M. F., Teleaga, B., and Ushio, A. Latent space cartography: Generalised metric-inspired measures and measure-based transformations for generative models. ar Xiv preprint ar Xiv:1902.02113, 2019.

Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhut

dinov, R. R. Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513 520, 2005.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, pp. 2672 2680, 2014.

Grattarola, D., Zambon, D., Alippi, C., and Livi, L. Learn

ing graph embeddings on constant-curvature manifolds for change detection in graph streams. ar Xiv preprint ar Xiv:1805.06299, 2018.

Hadjeres, G., Nielsen, F., and Pachet, F. GLSR-VAE: geodesic latent space regularization for variational au toencoder architectures. In IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1 7, 2017.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. Beta VAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.

Hoffer, E. and Ailon, N. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84 92. Springer, 2015.

Jakubovitz, D. and Giryes, R. Improving DNN robustness to adversarial attacks using Jacobian regularization. In ECCV, pp. 514 529, 2018.

Karaletsos, T., Belongie, S., and R atsch, G. Bayesian repre

sentation learning with oracle constraints. ICLR, 2016.

Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. ICLR, 2014.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving Variational Inference with Inverse Autoregressive Flow. NIPS, 2016.

Klushyn, A., Chen, N., Kurle, R., Cseke, B., and van der Smagt, P. Learning hierarchical priors in VAEs. Neur IPS, 2019.

Learning Flat Latent Manifolds with VAEs

Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In International Conference on Artifcial Intelligence and Statistics, pp. 29 37, 2011.

Lee, J. M. Riemannian manifolds: an introduction to curva

ture, volume 176. Springer Science & Business Media, 2006.

Li, T. and Ding, C. The relationships among various non

negative matrix factorization methods for clustering. In International Conference on Data Mining, pp. 362 371. IEEE, 2006.

Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. Hierarchical representations with Poincar\ e variational auto-encoders. Neur IPS, 2019.

Milan, A., Leal-Taix e, L., Reid, I., Roth, S., and Schindler,

K. Mot16: A benchmark for multi-object tracking. ar Xiv preprint ar Xiv:1603.00831, 2016.

Nie, W. and Patel, A. Towards a better understanding and regularization of GAN training dynamics. In UAI, 2019.

Rezende, D. J. and Viola, F. Taming VAEs. ar Xiv preprint ar Xiv:1810.00597, 2018.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochas

tic backpropagation and approximate inference in deep generative models. In ICML, volume 32, pp. 1278 1286, 2014.

Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller,

X. The manifold tangent classifer. In Advances in Neural Information Processing Systems, pp. 2294 2302, 2011a.

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. Higher order contractive auto-encoder. In ECML-PKDD, pp. 645 660. Springer, 2011b.

Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., and Tomasi,

C. Performance measures and a data set for multi-target, multi-camera tracking. Co RR, abs/1609.01775, 2016.

Sch uller, K.-R. Kernel princi olkopf, B., Smola, A., and M pal component analysis. In International conference on artifcial neural networks, pp. 583 588. Springer, 1997.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.

Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. NIPS, 2016.

Tomczak, J. M. and Welling, M. VAE with a vampprior. In International Conference on Artifcial Intelligence and Statistics, pp. 1214 1223, 2018.

Tosi, A., Hauberg, S., Vellido, A., and Lawrence, N. D. Metrics for probabilistic geometries. In UAI, pp. 800 808, 2014.

Verma, V., Lamb, A., Beckham, C., Najaf, A., Mitliagkas, I., Courville, A., Lopez-Paz, D., and Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. ICML, 2019.

Wang, J. M., Fleet, D. J., and Hertzmann, A. Gaussian pro

cess dynamical models for human motion. IEEE transac tions on pattern analysis and machine intelligence, 30(2): 283 298, 2007.

Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classifcation. Journal of Machine Learning Research, 10:207 244, 2009.

Wojke, N. and Bewley, A. Deep cosine metric learning for person re-identifcation. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 748 756, 2018.

Wojke, N., Bewley, A., and Paulus, D. Simple online and realtime tracking with a deep association metric. In IEEE International Conference on Image Processing, pp. 3645 3649, 2017.

Xing, E. P., Jordan, M. I., Russell, S. J., and Ng, A. Y. Distance metric learning with application to clustering with side-information. In NIPS, pp. 521 528, 2003.

Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., and Yan, J. POI: multiple object tracking with high performance detection and appearance feature. Co RR, abs/1610.06136, 2016.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. Interna tional Conference on Learning Representations, 2018.