# disco_nets__dissimilarity_coefficients_networks__7334e67f.pdf

DISCO Nets: DISsimilarity COefﬁcient Networks

Diane Bouchacourt University of Oxford diane@robots.ox.ac.uk

M. Pawan Kumar University of Oxford pawan@robots.ox.ac.uk

Sebastian Nowozin Microsoft Research Cambridge sebastian.nowozin@microsoft.com

We present a new type of probabilistic model which we call DISsimilarity COefﬁcient Networks (DISCO Nets). DISCO Nets allow us to efﬁciently sample from a posterior distribution parametrised by a neural network. During training, DISCO Nets are learned by minimising the dissimilarity coefﬁcient between the true distribution and the estimated distribution. This allows us to tailor the training to the loss related to the task at hand. We empirically show that (i) by modeling uncertainty on the output value, DISCO Nets outperform equivalent non-probabilistic predictive networks and (ii) DISCO Nets accurately model the uncertainty of the output, outperforming existing probabilistic models based on deep neural networks.

1 Introduction

We are interested in the class of problems that require the prediction of a structured output y Y given an input x X. Complex applications often have large uncertainty on the correct value of y. For example, consider the task of hand pose estimation from depth images, where one wants to accurately estimate the pose y of a hand given a depth image x. The depth image often has some occlusions and missing depth values and this results in some uncertainty on the pose of the hand. It is, therefore, natural to use probabilistic models that are capable of representing this uncertainty. Often, the capacity of the model is restricted and cannot represent the true distribution perfectly. In this case, the choice of the learning objective inﬂuences ﬁnal performance. Similar to Lacoste-Julien et al. [12], we argue that the learning objective should be tailored to the evaluation loss in order to obtain the best performance with respect to this loss. In details, we denote by training the loss function employed during model training, and by task the loss employed to evaluate the model s performance.

We present a simple example to illustrate the point made above. We consider a data distribution that is a mixture of two bidimensional Gaussians. We now consider two models to capture the data probability distribution. Each model is able to represent a bidimensional Gaussian distribution with diagonal covariance parametrised by (µ1, µ2, σ1, σ2). In this case, neither of the models will be able to recover the true data distribution since they do not have the ability to represent a mixture of Gaussians. In other words, we cannot avoid model error, similarly to the real data scenario. Each model uses its own training loss training. Model A employs a loss that emphasises on the ﬁrst dimension of the data, speciﬁed for x = (x1, x2), x = (x 1, x 2) R2

by A(x x ) = (10 (x1 x 1)2 + 0.1 (x2 x 2)2) 1/2. Model B does the opposite and employs the loss function B(x x ) = (0.1 (x1 x 1)2 + 10 (x2 x 2)2) 1/2. Each model performs a grid search over the best parameters values for (µ1, µ2, σ1, σ2). Figure 1 shows the contours of the Mixture of Gaussians distribution of the data (in black), and the contour of the Gaussian ﬁtted by each model (in red and green). Detailed setting of this example is available in the supplementary material.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Table 1: task SEM (standard error of the mean) with respect to training employed. Evaluation is done the test set.

training task A B

A 11.6 0.287 13.7 0.331 B 12.1 0.305 11.0 0.257

Figure 1: Contour lines of the Gaussian distribution ﬁtted by each model on the Mixture of Gaussians data distribution. Best viewed in color. As expected, the ﬁtted Gaussian distributions differ according to training employed. Table 1 shows that the loss on the test set, evaluated with task, is minimised if training = task. This simple example illustrates the advantage to being able to tailor the model s training objective function to have training = task. This is in contrast to the commonly employed learning objectives we present in Section 2, that are agnostic of the evaluation loss.

In order to alleviate the aforementioned deﬁciency of the state-of-the-art, we introduce DISCO Nets, a new class of probabilistic model. DISCO Nets represent P, the true posterior distribution of the data, with a distribution Q parametrised by a neural network. We design a learning objective based on a dissimilarity coefﬁcient between P and Q. The dissimilarity coefﬁcient we employ was ﬁrst introduced by Rao [23] and is deﬁned for any non-negative symmetric loss function. Thus, any such loss can be incorporated in our setting, allowing the user to tailor DISCO Nets to his or her needs. Finally, contrarily to existing probabilistic models presented in Section 2, DISCO Nets do not require any speciﬁc architecture or training procedure, making them an efﬁcient and easy-to-use class of model. 2 Related Work Deep neural networks, and in particular, Convolutional Neural Networks (CNNs) are comprised of several convolutional layers, followed by one or more fully connected (dense) layers, interleaved by non-linear function(s) and (optionally) pooling. Recent probabilistic models use CNNs to represent non-linear functions of the data. We observe that such models separate into two types. The ﬁrst type of model does not explicitly compute the probability distribution of interest. Rather, these models allow the user to sample from this distribution by feeding the CNN with some noise z. Among such models, Generative Adversarial Networks (GAN) presented in Goodfellow et al. [7] are very popular and have been used in several computer vision applications, for example in Denton et al. [1], Radford et al. [22], Springenberg [25] and Yan et al. [28]. A GAN model consists of two networks, simultaneously trained in an adversarial manner. A generative model, referred as the Generator G, is trained to replicate the data from noise, while an adversarial discriminative model, referred as the Discriminator D, is trained to identify whether a sample comes from the true data or from G. The GAN training objective is based on a minimax game between the two networks and approximately optimizes a Jensen-Shannon divergence. However, as mentioned in Goodfellow et al. [7] and Radford et al. [22], GAN models require very careful design of the networks architecture. Their training procedure is tedious and tends to oscillate. GAN models have been generalized to conditional GAN (c GAN) in Mirza and Osindero [16], where some additional input information can be fed to the Generator and the Discriminator. For example in Mirza and Osindero [16] a c GAN model generates tags corresponding to an image. Gauthier [4] applies c GAN to face generation. Reed et al. [24] propose to generate images of ﬂowers with a c GAN model, where the conditional information is a word description of the ﬂower to generate1. While the application of c GAN is very promising, little quantitative evaluation has been done. Furthermore, c GAN models suffer from the same difﬁculties we mentioned for GAN. Another line of work has developed towards the use of statistical hypothesis testing to learn probabilistic models. In Dziugaite et al. [2] and Li et al. [14], the authors propose to train generative deep networks with an objective function based on the Maximum Mean Discrepancy (MMD) criterion. The MMD method (see Gretton et al. [8, 9]) is a statistical hypothesis test assessing if two probabilistic distributions are similar. As mentioned in Dziugaite et al. [2], the MMD test can been seen as playing the role of an adversary.

1At the time writing, we do not have access to the full paper of Reed et al. [24] and therefore cannot take advantage of this work in our experimental comparison.

The second type of model approximates intractable posterior distributions with use of variational inference. The Variational Auto-Encoders (VAE) presented in Kingma and Welling [10] is composed of a probabilistic encoder and a probabilistic decoder. The probabilistic encoder is fed with the input x X and produces a posterior distribution P(z|x) over the possible values of noise z that could have generated x. The probabilistic decoder learns to map the noise z back to the data space X. The training of VAE uses an objective function based on a Kullback-Leibler Divergence. VAE and GAN models have been combined in Makhzani et al. [15], where the authors propose to regularise autoencoders with an adversarial network. The adversarial network ensures that the posterior distribution P(z|x) matches an arbitrary prior P(z).

In hand pose estimation, imagine the user wants to obtain accurate positions of the thumb and the index ﬁnger but does not need accurate locations of the other ﬁngers. The task loss task might be based on a weighted L2-norm between the predicted and the ground-truth poses, with high weights on the thumb and the index. Existing probabilistic models cannot be tailored to task-speciﬁc losses and we propose the DISsimilarity COefﬁcient Networks (DISCO Nets) to alleviate this deﬁciency.

3 DISCO Nets

We begin the description of our model by specifying how it can be used to generate samples from the posterior distribution, and how the samples can in turn be employed to provide a pointwise estimate. In the subsequent subsection, we describe how to estimate the parameters of the model.

3.1 Prediction Sampling. A DISCO Net consists of several convolutional and dense layers (interleaved by nonlinear function(s) and possibly pooling) and takes as input a pair (x, z) X Z, where x is input data and z is some random noise. Given one pair (x, z), the DISCO Net produces a value for the output y. In the example of hand pose estimation, the input depth image x is fed to the convolutional layers. The output of the last convolutional layer is ﬂattened and concatenated with a noise sample z. The resulting vector is fed to several dense layers, and the last dense layer outputs a pose y. From a single depth image x, by using different noise samples, the DISCO Net produces different pose candidates for the depth image. This process is illustrated in Figure 2. Importantly, DISCO Nets are ﬂexible in the choice of the architecture. For example, the noise could be concatenated at any stage of the network, including at the start.

Figure 2: For a single depth image x, using 3 different noise samples (z1, z2, z3), DISCO Nets output 3 different candidate poses (y1, y2, y3) (shown superimposed on the depth image). The depth image is from the NYU Hand Pose Dataset of Tompson et al. [27], preprocessed as in Oberweger et al. [17]. Best viewed in color.

We denote Q the distribution that is parametrised by the DISCO Net s neural network. For a given input x, DISCO Nets provide the user with samples y drawn from Q(y|x) without requiring the expensive computation of the (often intractable) partition function. In the remainder of the paper we consider x Rdx, y Rdy and z Rdz.

Pointwise Prediction. In order to obtain a single prediction y for a given input x, DISCO Nets use the principle of Maximum Expected Utility (MEU), similarly to Premachandran et al. [21]. The prediction y task maximises the expected utility, or rather minimises the expected task-speciﬁc loss task, estimated using the sampled candidates. Formally, the prediction is made as follows:

y task = argmax k [1,K] EU(yk) = argmin k [1,K]

k =1 task(yk, y k) (1)

where (y1, ..., y K) are the candidate outputs sampled for the single input x. Details on the MEU method are in the supplementary material.

3.2 Learning DISCO Nets

Objective Function. We want DISCO Nets to accurately model the true probability P(y|x) via Q(y|x). In other words, Q(y|x) should be as similar as possible to P(y|x). This similarity is evaluated with respect to the loss speciﬁc to the task at hand. Given any non-negative symmetric loss function between two outputs (y, y ) with (y, y ) Y Y, we employ a diversity coefﬁcient that is the expected loss between two samples drawn randomly from the two distributions. Formally, the diversity coefﬁcient is deﬁned as:

DIV (P, Q, D) = Ex D(x)[Ey P (y|x)[Ey Q(y |x)[ (y, y )]]] (2)

Intuitively, we should minimise DIV (P, Q, D) so that Q(y|x) is as similar as possible to P(y|x). However there is uncertainty on the output y to predict for a given x. In other words, P(y|x) is diverse and Q(y|x) should be diverse as well. Thus we encourage Q(y|x) to provide sample outputs, for a given x, that are diverse by minimising the following dissimilarity coefﬁcient:

DISC (P, Q, D) = DIV (P, Q, D) γDIV (Q, Q, D) (1 γ)DIV (P, P, D) (3)

with γ [0, 1]. The dissimilarity DISC (P, Q, D) is the difference between the diversity between P and Q, and an afﬁne combination of the diversity of each distribution, given x D(x). These coefﬁcients were introduced by Rao [23] with γ = 1/2 and used for latent variable models by Kumar et al. [11]. We do not need to consider the term DIV (P, P, D) as it is a constant in our problem, and thus the DISCO Nets objective function is deﬁned as follows:

F = DIV (P, Q, D) γDIV (Q, Q, D) (4)

When minimising F, the term γDIV (Q, Q, D) encourages Q(y|x) to be diverse. The value of γ balances between the two goals of Q(y|x) that are providing accurate outputs while being diverse.

Optimisation. Let us consider a training dataset composed of N examples input-output pairs D = {(xn, yn), n = 1..N}. In order to train DISCO Nets, we need to compute the objective function of equation (4). We do not have knowledge of the true probability distributions P(y, x) and P(x). To overcome this deﬁciency, we construct estimators of each diversity term DIV (P, Q) and DIV (Q, Q). First, we take an empirical distribution of the data, that is, taking ground-truth pairs (xn, yn). We then estimate each distribution Q(y|xn) by sampling K outputs from our model for each xn. This gives us an unbiased estimate of each diversity term, deﬁned as:

d DIV (P, Q, D) = 1

k=1 (yn, G(zk, xn; θ))

d DIV (Q, Q, D) = 1

k =1,k =k (G(zk, xn; θ), G(zk , xn; θ))

We have an unbiased estimate of the DISCO Nets objective function of equation (4):

b F( , θ) = d DIV (P, Q, D) γ d DIV (Q, Q, D) (6)

where yk = G(zk, xn; θ) is a candidate output sampled from DISCO Nets for (xn,zk), and θ are the parameters of DISCO Nets. It is important to note that the second term of equation (6) is summing over k and k = k to have an unbiased estimate, therefore we compute the loss between pairs of different samples G(zk, xn; θ) and G(zk , xn; θ). The parameters θ are learned by Gradient Descent. Algorithm 1 shows the training of DISCO Nets. In steps 4 and 5 of Algorithm 1, we draw K random noise vectors (zn,1, ...zn,k) per input example xn, and generate K candidate outputs G(zn,k, xn; θ) per input. This allow us to compute an unbiased estimate of the gradient in step 7. For clarity, in the remainder of the paper we do not explicitely write the parameters θ and write G(zk, xn).

Algorithm 1: DISCO Nets Training algorithm.

1 for t=1...T epochs do

2 Sample minibatch of b training example pairs {(x1, y1)...(xb, yb)}.

3 for n=1...b do

4 Sample K random noise vectors (zn,1, ...zn,k) for training example xn 5 Generate K candidate outputs G(zn,k, xn; θ), k = 1..K for training example xn 6 end

7 Update parameters θt θt 1 by descending the gradient of equation (6) : θ b F( , θ).

3.3 Strictly Proper Scoring Rules. Scoring Rule for Learning. A scoring rule S(Q, P), as deﬁned in Gneiting and Raftery [5], evaluates the quality of a predictive distribution Q with respect to a true distribution P. When using a scoring rule one should ensure that it is proper, which means it is maximised when P = Q. A scoring rule is said to be strictly proper if P = Q is the unique maximiser of S. Hence maximising a proper scoring rule ensures that the model aims at predicting relevant forecast. Gneiting and Raftery [5] deﬁne score divergences corresponding to a proper scoring rule S:

d(Q, P) = S(P, P) S(Q, P) (7)

If S is proper, d is a valid non-negative divergence function, with value 0 if (and only if, in the case of strictly proper) Q = P. For example the MMD criterion (see Gretton et al. [8, 9]) mentioned in Section 2 is an example of this type of divergence. In our case, any loss expressed with an universal kernel will deﬁne the DISCO Nets objective function as such divergence (see Zawadzki and Lahaie [29]). For example, by Theorem 5 of Gneiting and Raftery [5], if we take as loss function β(y, y ) = ||y y ||β 2 = Pdy i=1 |(yi y i|2) β/2 with β [0, 2] excluding 0 and 2, our training objective is (the negative of) a strictly proper scoring rule, that is:

b F( , θ) = 1

N PN n=1 h 1

k ||yn G(zk, xn)||β 2 1

2 1 K(K 1) P

k =k ||G(zk , xn) G(zk, xn)||β 2 i

(8) This score has been referred in the litterature as the Energy Score in Gneiting and Raftery [5], Gneiting et al. [6], Pinson and Tastu [19].

By employing a (strictly) proper scoring rule we ensure that our objective function is (only) minimised at the true distribution P, and expect DISCO Nets to generalise better on unseen data. We show below that strictly proper scoring rules are also relevant to assess the quality of the distribution Q captured by the model.

Discriminative power of proper scoring rules. As observed in Fukumizu et al. [3], kernel density estimation (KDE) fails in high dimensional output spaces. Our goal is to compare the quality of the distribution captured between two models, Q1 and Q2. In our setting Q1 better models P than Q2 according to the scoring rule S and its associated divergence d if d(Q1, P) < d(Q2, P). As noted in Pinson and Tastu [19], S being proper does not ensure d(Q1, y) < d(Q2, y) for all observations y drawn from P. However if the scoring rule is strictly proper scoring rule, this property should be ensured in the neighbourhood of the true distribution.

4 Experiments : Hand Pose Estimation

Given a depth image x, which often contains occlusions and missing values, we wish to predict the hand pose y. We use the NYU Hand Pose dataset of Tompson et al. [27] to estimate the efﬁciency of DISCO Nets for this task.

4.1 Experimental Setup NYU Hand Pose Dataset. The NYU Hand Pose dataset of Tompson et al. [27] contains 8252 testing and 72,757 training frames of captured RGBD data with ground-truth hand pose information. The training set is composed of images of one person whereas the testing set gathers samples from two persons. For each frame, the RGBD data from 3 Kinects is provided: a frontal view and 2 side views. In our experiments we use only the depth data from the frontal view. While the ground truth

contains J = 36 annotated joints, we follow the evaluation protocol of Oberweger et al. [17, 18] and use the same subset of J = 14 joints. We also perform the same data preprocessing as in Oberweger et al. [17, 18], and extract a ﬁxed-size metric cube around the hand from the depth image. We resize the depth values within the cube to a 128 128 patch and normalized them in [ 1, 1]. Pixels deeper than the back of the cube and missing depth values are both set to a depth of 1.

Methods. We employ loss functions between two outputs of the form of the Energy score (8), that is, training = β(y, y ) = ||y y ||β 2. Our ﬁrst goal is to assess the advantages of DISCO Nets with respect to non-probabilistic deep networks. One model, referred as DISCOβ,γ, is a DISCO Nets probabilistic model, with γ = 0 in the dissimilarity coefﬁcient of equation (6). When taking γ = 0, noise is injected and the model capacity is the same as DISCOβ,γ =0. The model BASEβ, is a non-probabilistic model, by taking γ = 0 in the objective function of equation (6) and no noise is concatenated. This corresponds to a classic deep network which for a given input x generates a single output y = G(x). Note that we write G(x) and not G(z, x) since no noise is concatenated.

Evaluation Metrics. We report classic non-probabilistic metrics for hand pose estimation employed in Oberweger et al. [17, 18] and Taylor et al. [26], that are, the Mean Joint Euclidean Error (Me JEE), the Max Joint Euclidean Error (Ma JEE) and the Fraction of Frames within distance (FF). We refer the reader to the supplementary material for detailed expression of these metrics. These metrics use the Euclidean distance between the prediction and the ground-truth and require a single pointwise prediction. This pointwise prediction is chosen with the MEU method among K candidates. We added the probabilistic metric Prob Loss. Prob Loss is deﬁned as in Equation 8 with the Euclidean norm and is the divergence associated with a strictly proper scoring rule. Thus, Prob Loss ranks the ability of the models to represent the true distribution. Prob Loss is computed using K candidate poses for a given depth image. For the non-probabilistic model BASEβ, only a single pointwise predicted output y is available. We construct the K candidates by adding some Gaussian random noise of mean 0 and diagonal covariance Σ = σ1, with σ {1mm, 5mm, 10mm} and refer to the model as BASEβ,σ. 2

Loss functions. As we employ standard evaluation metrics based on the Euclidean norm, we train with the Euclidean norm (that is, training(y, y ) = ||y y ||β 2 with β = 1). When γ = 1

2 our objective function coincides with Prob Loss.

Architecture. The novelty of DISCO Nets resides in their objective function. They do not require the use of a speciﬁc network architecture. This allows us to design a simple network architecture inspired by Oberweger et al. [18]. The architecture is shown in Figure 2. The input depth image x is fed to 2 convolutional layers, each having 8 ﬁlters, with kernels of size 5 5, with stride 1, followed by Rectiﬁed Linear Units (Re LUs) and Max Pooling layers of kernel size 3 3. A third and last convolutional layer has 8 ﬁlters, with kernels of size 5 5, with stride 1, followed by a Rectiﬁed Linear Unit. The ouput of the convolution is concatenated to the random noise vector z of size dz = 200, drawn from a uniform distribution in [ 1, 1]. The result of the concatenation is fed to 2 dense layers of output size 1024, with Re LUs, and a third dense layer that outputs the candidate pose y R3 J. For the non-probabilistic BASEβ,σ model no noise is concatenated as only a pointwise estimate is produced.

Training. We use 10,000 examples from the 72,757 training frames to construct a validation dataset and train only on 62,757 examples. Back-propagation is used with Stochastic Gradient Descent with a batchsize of 256. The learning rate is ﬁxed to λ = 0.01 and we use a momentum of m = 0.9 (see Polyak [20]). We also add L2-regularisation controlled by the parameter C. We use C = [0.0001, 0.001, 0.01] which is a relevant range as the comparative model BASEβ is best performing for C = 0.001. Note that DISCO Nets report consistent performances across the different values C, contrarily to BASEβ. We use 3 different random seeds to initialize each model network parameters. We report the performance of each model with its best cross-validated seed and C. We train all models for 400 epochs as it results in a change of less than 3% in the value of the loss on the validation dataset for BASEβ. We refer the reader to the supplementary material for details on the setting.

2We also evaluate the non-probabilistic model BASEβ using its pointwise prediction rather than the MEU method. Results are consistent and detailed in the supplementary material.

Table 2: Metrics values on the test set SEM. Best performances in bold.

Model Prob Loss (mm) Me JEE (mm) Ma JEE (mm) FF (80mm) BASEβ=1,σ=1 103.8 0.627 25.2 0.152 52.7 0.290 86.040 BASEβ=1,σ=5 99.3 0.620 25.5 0.151 52.9 0.289 85.773 BASEβ=1,σ=10 96.3 0.612 25.7 0.149 53.2 0.288 85.664 DISCOβ=1,γ=0 92.9 0.533 21.6 0.128 46.0 0.251 92.971 DISCOβ=1,γ=0.25 89.9 0.510 21.2 0.122 46.4 0.252 93.262 DISCOβ=1,γ=0.5 83.8 0.503 20.9 0.124 45.1 0.246 94.438

Table 3: Metrics values on the test set SEM for c GAN.

Model Prob Loss (mm) Me JEE (mm) Ma JEE (mm) FF (80mm) c GAN 442.7 0.513 109.8 0.128 201.4 0.320 0.000 c GANinit, ﬁxed 128.9 0.480 31.8 0.117 64.3 0.230 78.454

4.2 Results.

Quantitative Evaluation. Table 2 reports performances on the test dataset, with parameters crossvalidated on the validation set. All versions of the DISCO Net model outperform the BASEβ model. Among the different values of γ, we see that γ = 0.5 better captures the true distribution (lower Prob Loss) while retaining accurate performance on the standard pointwise metrics. Interestingly, using an all-zero noise at test-time gives similar performances on pointwise metrics. We link this to the observation that both the MEAN and the MEU method perform equivalently on these metrics (see supplementary material).

Qualitative Evaluation. In Figure 3 we show candidate poses generated by DISCOβ=1,γ=0.5 for 3 testing examples. The left image shows the input depth image, and the right image shows the ground-truth pose (in grey) with 100 candidate outputs (superimposed in transparent red). The model predict the joint locations and we interpolate the joints with edges. If an edge is thinner and more opaque, it means the different predictions overlap and that the uncertainty on the location of the edge s joints is low. We can see that DISCOβ=1,γ=0.5 captures relevant information on the structure of the hand.

(a) When there are no occlusions, DISCO Nets model low uncertainty on all joints.

(b) When the hand is half-ﬁsted, DISCO Nets model the uncertainty on the location of the ﬁngertips.

(c) Here the ﬁngertips of all ﬁngers but the foreﬁnger are occluded and DISCO Nets model high uncertainty on them. Figure 3: Visualisation of DISCOβ=1,γ=0.5 predictions for 3 examples from the testing dataset. The left image shows the input depth image, and the right image shows the ground-truth pose in grey with 100 candidate outputs superimposed in transparent red. Best viewed in color.

Figure 4 shows the matrices of Pearson product-moment correlation coefﬁcients between joints. We note that DISCO Net with γ = 0.5 better captures the correlation between the joints of a ﬁnger and between the ﬁngers.

P PR PL TR TM TT IM IT MM MT RM RT PM PT

P PR PL TR TM TT IM IT MM MT RM RT PM PT

P PR PL TR TM TT IM IT MM MT RM RT PM PT

P PR PL TR TM TT IM IT MM MT RM RT PM PT

Figure 4: Pearson coefﬁcients matrices of the joints: Palm (no value as the empirical variance is null), Palm Right, Palm Left, Thumb Root, Thumb Mid, Index Mid, Index Tip, Middle Mid, Middle Tip, Ring Mid, Ring Tip, Pinky Mid, Pinky Tip.

4.3 Comparison with existing probabilistic models. To the best of our knowledge the conditional Generative Adversarial Networks (c GAN) from Mirza and Osindero [16] has not been applied to pose estimation. In order to compare c GAN to DISCO Nets, several issues must be overcome. First, we must design a network architecture for the Discriminator. This is a ﬁrst disadvantage of c GAN compared to DISCO Nets which require no adversary. Second, as mentioned in Goodfellow et al. [7] and Radford et al. [22], GAN (and thus c GAN) require very careful design of the networks architecture and training procedure. In order to do a fair comparison, we followed the work in Mirza and Osindero [16] and practical advice for GAN presented in Larsen and Sønderby [13]. We try (i) c GAN, initialising all layers of D and G randomly, and (ii) c GANinit, ﬁxed initialising the convolutional layers of G and D with the trained best-performing DISCOβ=1,γ=0.5 of Section 4.2, and keeping these layers ﬁxed. That is, the convolutional parts of G and D are ﬁxed feature extractors for the depth image. This is a setting similar to the one employed for tag-annotation of images in Mirza and Osindero [16]. Details on the setting can be found in the supplementary material. Table 3 shows that the c GAN model obtains relevant results only when the convolutional layers of G and D are initialised with our trained model and kept ﬁxed, that is c GANinit, ﬁxed. These results are still worse than DISCO Nets performances. While there may be a better architecture for c GAN, our experiments demonstrate the difﬁculty of training c GAN over DISCO Nets.

4.4 Reference state-of-the-art values. We train the best-performing DISCOβ=1,γ=0.5 of Section 4.2 on the entire dataset, and compare performances with state-of-the-art methods in Table 4 and Figure 5. These state-of-the-art methods are speciﬁcally designed for hand pose estimation. In Oberweger et al. [17] a constrained prior hand model, referred as NYU-Prior, is reﬁned on each hand joint position to increase accuracy, referred as NYU-Prior-Reﬁned. In Oberweger et al. [18], the input depth image is fed to a ﬁrst network NYU-Init, that outputs a pose used to synthesize an image with a second network. The synthesized image is used with the input depth image to derive a pose update. We refer to the whole model as NYU-Feedback. On the contrary, DISCO Nets uses a single network whose architecture is similar to NYU-Prior (without constraining on a pose prior). By accurately modeling the distribution of the pose given the depth image, DISCO Nets obtain comparable performances to NYU-Prior and NYU-Prior-Reﬁned. Without any extra effort, DISCO Nets could be embedded in the presented reﬁnement and feedback methods, possibly boosting state-of-the-art performances.

Table 4: DISCO Nets compared to stateof-the-art performances SEM.

Model Me JEE (mm) Ma JEE (mm) FF (80mm) NYU-Prior 20.7 0.150 44.8 0.289 91.190 NYU-Prior-Reﬁned 19.7 0.157 44.7 0.327 88.148 NYU-Init 27.4 0.152 55.4 0.265 86.537 NYU-Feedback 16.0 0.096 36.1 0.208 97.334 DISCOβ=1,γ=0.5 20.7 0.121 45.1 0.246 93.250

Figure 5: Fractions of frames within distance d in mm (by 5 mm). Best viewed in color. 5 Discussion.

We presented DISCO Nets, a new family of probabilistic model based on deep networks. DISCO Nets employ a prediction and training procedure based on the minimisation of a dissimilarity coefﬁcient. Theoretically, this ensures that DISCO Nets accurately capture uncertainty on the correct output to predict given an input. Experimental results on the task of hand pose estimation consistently support our theoretical hypothesis as DISCO Nets outperform non-probabilistic equivalent models, and existing probabilistic models. Furthermore, DISCO Nets can be tailored to the task to perform. This allows a possible user to train them to tackle different problems of interest. As their novelty resides mainly in their objective function, DISCO Nets do not require any speciﬁc architecture and can be easily applied to new problems. We contemplate several directions for future work. First, we will apply DISCO Nets to other prediction problems where there is uncertainty on the output. Second, we would like to extend DISCO Nets to latent variables models, allowing us to apply DISCO Nets to diverse dataset where ground-truth annotations are missing or incomplete.

6 Acknowlegements.

This work is funded by the Microsoft Research Ph D Scholarship Programme. We would like to thank Pankaj Pansari, Leonard Berrada and Ondra Miksik for their useful discussions and insights.

References.

[1] E.L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS. 2015. [2] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015. [3] K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes rule: Bayesian inference with positive deﬁnite kernels. JMLR, 2013. [4] J. Gauthier. Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, 2014. [5] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 2007. [6] Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson. Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. TEST, 2008. [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, Bing Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS. 2014. [8] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J. Smola. A kernel method for the two-sample problem. In NIPS, 2007. [9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J. Smola. A kernel two-sample test. In JMLR, 2012. [10] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014. [11] M. P. Kumar, B. Packer, and D. Koller. Modeling latent variable uncertainty for loss-based learning. In ICML, 2012. [12] S. Lacoste-Julien, F. Huszar, and Z. Ghahramani. Approximate inference for the loss-calibrated Bayesian. In AISTATS, 2011. [13] A. B. L. Larsen and S. K. Sønderby. URL http://torch.ch/blog/2015/11/13/gan. html. [14] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015. [15] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow. Adversarial autoencoders. ICLR Workshop, 2015. [16] M. Mirza and S. Osindero. Conditional generative adversarial nets. In NIPS Deep Learning Workshop, 2014. [17] M. Oberweger, P. Wohlhart, and V. Lepetit. Hands deep in deep learning for hand pose estimation. In Computer Vision Winter Workshop, 2015. [18] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a Feedback Loop for Hand Pose Estimation. In ICCV, 2015. [19] Pierre Pinson and Julija Tastu. Discrimination ability of the Energy score. 2013. [20] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. 1964. [21] V. Premachandran, D. Tarlow, and D. Batra. Empirical minimum Bayes risk prediction: How to extract an extra few% performance from vision models with just three more parameters. In CVPR, 2014. [22] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2015. [23] C.R. Rao. Diversity and dissimilarity coefﬁcients: A uniﬁed approach. Theoretical Population Biology, pages Vol. 21, No. 1, pp 24 43, 1982. [24] S. Reed, Z. Akata, X. Yan, L. Logeswaran, H. Lee, and B. Schiele. Generative adversarial text to image synthesis. In ICML, 2016. [25] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. ICLR, 2016. [26] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The vitruvian Manifold: Inferring dense correspondences for oneshot human pose estimation. In CVPR, 2012. [27] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics, 2014. [28] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. 2016. [29] E. Zawadzki and S. Lahaie. Nonparametric scoring rules. In AAAI Conference on Artiﬁcial Intelligence. 2015.