# unsupervised_transformation_learning_via_convex_relaxations__621e2d9e.pdf

Unsupervised Transformation Learning via Convex Relaxations

Tatsunori B. Hashimoto John C. Duchi Percy Liang Stanford University Stanford, CA 94305 {thashim,jduchi,pliang}@cs.stanford.edu

Our goal is to extract meaningful transformations from raw images, such as varying the thickness of lines in handwriting or the lighting in a portrait. We propose an unsupervised approach to learn such transformations by attempting to reconstruct an image from a linear combination of transformations of its nearest neighbors. On handwritten digits and celebrity portraits, we show that even with linear transformations, our method generates visually high-quality modiﬁed images. Moreover, since our method is semiparametric and does not model the data distribution, the learned transformations extrapolate off the training data and can be applied to new types of images.

1 Introduction

Transformations (e.g, rotating or varying the thickness of a handwritten digit) capture important invariances in data, which can be useful for dimensionality reduction [7], improving generative models through data augmentation [2], and removing nuisance variables in discriminative tasks [3]. However, current methods for learning transformations have two limitations. First, they rely on explicit transformation pairs for example, given pairs of image patches undergoing rotation [12]. Second, improvements in transformation learning have focused on problems with known transformation classes, such as orthogonal or rotational groups [3, 4], while algorithms for general transformations require solving a difﬁcult, nonconvex objective [12].

To tackle the above challenges, we propose a semiparametric approach for unsupervised transformation learning. Speciﬁcally, given data points x1, . . . , xn, we ﬁnd K linear transformations A1 . . . AK such that the vector from each xi to its nearest neighbor lies near the span of A1xi . . . AKxi. The idea of using nearest neighbors for unsupervised learning has been explored in manifold learning [1, 7], but unlike these approaches and more recent work on representation learning [2, 13], we do not seek to model the full data distribution. Thus, even with relatively few parameters, the transformations we learn naturally extrapolate off the training distribution and can be applied to novel types of points (e.g., new types of images).

Our contribution is to express transformation matrices as a sum of rank-one matrices based on samples of the data. This new objective is convex, thus avoiding local minima (which we show to be a problem in practice), scales to real-world problems beyond the 10 10 image patches considered in past work, and allows us to derive disentangled transformations through a trace norm penalty.

Empirically, we show our method is fast and effective at recovering known disentangled transformations, improving on past baseline methods based on gradient descent and expectation maximization [11]. On the handwritten digits (MNIST) and celebrity faces (Celeb A) datasets, our method ﬁnds interpretable and disentangled transformations for handwritten digits, the thickness of lines and the size of loops in digits such as 0 and 9; and for celebrity faces, the degree of a smile.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

2 Problem statement

Given a data point x Rd (e.g., an image) and strength scalar t R, a transformation is a smooth function f : Rd R Rd. For example, f(x, t) may be a rotated image. For a collection {fk}K k=1 of transformations, we consider entangled transformations, deﬁned for a vector of strengths t RK

by f(x, t) := PK k=1 fk(x, tk). We consider the problem of estimating a collection of transformations f := PK k=1 f k given random observations as follows: let p X be a distribution on points x and p T on transformation strength vectors t RK, where the components tk are independent under p T. Then for xi iid p X and ti iid p T, i = 1, . . . , n, we observe the transformations xi = f ( xi, ti), while xi and ti are unobserved. Our goal is to estimate the K functions f 1 , . . . , f K.

2.1 Learning transformations based on matrix Lie groups

In this paper, we consider the subset of generic transformations deﬁned via matrix Lie groups. These are natural as they map Rd Rd and form a family of invertible transformations that we can parameterize by an exponential map. We begin by giving a simple example (rotation of points in two dimensions) and using this to establish the idea of the exponential map and its linear approximation. We then use these linear approximations for transformation learning.

A matrix Lie group is a set of invertible matrices closed under multiplication and inversion. In the example of rotation in two dimensions, the set of all rotations is parameterized by the angle θ, and

any rotation by θ has representation Rθ = cos(θ) sin(θ) sin(θ) cos(θ)

. The set of rotation matrices form a

Lie group, as RθR θ = I and the rotations are closed under composition.

Linear approximation. In our context, the important property of matrix Lie groups is that for transformations near the identity, they have local linear approximations (tangent spaces, the associated Lie algebra), and these local linearizations map back into the Lie group via the exponential map [9]. As a simple example, consider the rotation Rθ, which satisﬁes Rθ = I + θA + O(θ2), where

A = 0 1 1 0

, and Rθ = exp(θA) for all θ (here exp is the matrix exponential). The inﬁnitesimal

structure of Lie groups means that such relationships hold more generally through the exponential map: for any matrix Lie group G Rd d, there exists ε > 0 such that for all R G with R I ε, there is an A Rd d such that R = exp(A) = I + P

m 1 Am/m!. In the case that G is a one-dimensional Lie group, we have more: for each R near I, there is a t R satisfying

R = exp(t A) = I +

The matrix t A = log R in the exponential map is the derivative of our transformation (as A (R I)/t for R I small) and is analogous to locally linear neighborhoods in manifold learning [10]. The exponential map states that for transformations close to the identity, a linear approximation is accurate.

For any matrix A, we can also generate a collection of associated 1-dimensional manifolds as follows: letting x Rd, the set Mx = {exp(t A)x | t R} is a manifold containing x. Given two nearby points xt = exp(t A)x and xs = exp(s A)x, the local linearity of the exponential map shows that

xt = exp((t s)A)xs = xs + (t s)Axs + O((t s)2) xs + (t s)Axs. (1)

Single transformation learning. The approximation (1) suggests a learning algorithm for ﬁnding a transformation from points on a one-dimensional manifold M: given points x1, . . . , xn sampled from M, pair each point xi with its nearest neighbor xi. Then we attempt to learn a transformation matrix A satisfying xi xi + ti Axi for some small ti for each of these nearest neighbor pairs. As nearest neighbor distances xi xi 0 as n [6], the linear approximation (1) eventually holds. For a one-dimensional manifold and transformation, we could then solve the problem

minimize {ti},A

i=1 ||ti Axi (xi xi)||2. (2)

If instead of using nearest neighbors, the pairs (xi, xi) were given directly as supervision, then this objective would be a form of ﬁrst-order matrix Lie group learning [12].

Sampling and extrapolation. The learning problem (2) is semiparametric: our goal is to learn a transformation matrix A while considering the density of points x as a nonparametric nuisance variable. By focusing on the modeling differences between nearby (x, x) pairs, we avoid having to specify the density of x, which results in two advantages: ﬁrst, the parametric nature of the model means that the transformations A are deﬁned beyond the support of the training data; and second, by not modeling the full density of x, we can learn the transformation A even when the data comes from highly non-smooth distributions with arbitrary cluster structure.

3 Convex learning of transformations

The problem (2) makes sense only for one-dimensional manifolds without superposition of transformations, so we now extend the ideas (using the exponential map and its linear approximation) to a full matrix Lie group learning problem. We shall derive a natural objective function for this problem and provide a few theoretical results about it.

3.1 Problem setup

As real-world data contains multiple degrees of freedom, we learn several one-dimensional transformations, giving us the following multiple Lie group learning problem:

Deﬁnition 3.1. Given data x1 . . . xn Rd with xi Rd as the nearest neighbor of xi, the nonconvex transformation learning problem objective is

minimize t Rd K,A Rd d

k=1 tik Akxi (xi xi)

This problem is nonconvex, and prior authors have commented on the difﬁculty of optimizing similar objectives [11, 14]. To avoid this difﬁculty, we will construct a convex relaxation. Deﬁne a matrix Z Rn d2, where row Zi is an unrolling of the transformation that approximately takes any xi to xi. Then Eq. (3) can be written as

min rank(Z)=K

i=1 mat(Zi)xi (xi xi) 2 , (4)

where mat : Rd2 Rd d is the matricization operator. Note the rank of Z is at most K, the number of transformations. We then relax the rank constraint to a trace norm penalty as

i=1 mat(Zi)xi (xi xi) 2 + λ Z . (5)

However, the matrix Z Rn d2 is too large to handle for real-world problems. Therefore, we propose approximating the objective function by modeling the transformation matrices as weighted sums of observed transformation pairs. This idea of using sampled pairs is similar to a kernel method: we will show that the true transformation matrices A k can be written as a linear combination of rank-one matrices (xi xi)x i . 1

As intuition, assume that we are given a single point xi Rd and xi = ti A xi + xi, where ti R is unobserved. If we approximate A via the rank-one approximation A = (xi xi)x i , then xj 2 2 Axi + xi = xi. This shows that A captures the behavior of A on a single point xi. By sampling sufﬁciently many examples and appropriately weighting each example, we can construct an accurate approximation over all points.

1Section 9 of the supplemental material introduces a kernelized version that extends this idea to general manifolds.

Let us subsample x1, . . . , xr (WLOG, these are the ﬁrst r points). Given these samples, let us write a transformation A as a weighted sum of r rank-one matrices (xj xj)x j with weights α Rn r. We then optimize these weights:

j=1 αij(xj xj)x j xi (xi xi)

+ λ α . (6)

Next we show that with high probability, the weighted sum of O(K2d) samples is close in operator norm to the true transformation matrix A (Lemma 3.2 and Theorem 3.3).

3.2 Learning one transformation via subsampling

We begin by giving the intuition behind the sampling based objective in the one-transformation case. The correctness of rank-one reconstruction is obvious for the special case where the number of samples r = d, and for each i we deﬁne xi = ei, where ei is the i-th canonical basis vector. In this case xi = ti A ei + ei for some unknown ti R. Thus we can easily reconstruct A with a weighted combination of rank-one samples as A = P

i A eie i = P

i αi(xi xi)x i with αi = t 1 i .

In the general case, we observe the effects of A on a non-orthogonal set of vectors x1 . . . xr as xi xi = ti A xi. A similar argument follows by changing our basis to make tixi the i-th canonical basis vector and reconstructing A in this new basis. The change of basis matrix for this case is the map Σ 1/2 where Σ = Pr i=1 xix i /r.

Our lemma below makes the intuition precise and shows that given r > d samples, there exists weights α Rd such that A = P

i αi(xi xi)x i Σ 1, where Σ is the inner product matrix from above. This justiﬁes our objective in Eq. (6), since we can whiten x to ensure Σ = I, and there exists weights αij which minimizes the objective by reconstructing A . Lemma 3.2. Given x1 . . . xr drawn i.i.d. from a density with full-rank covariance, and neighboring points xi . . . xr deﬁned by xi = ti A xi + xi for some unknown ti = 0 and A Rd d.

If r d, then there exists weights α Rr which recover the unknown A as

i=1 αi(xi xi)x i Σ 1,

where αi = 1/(rti) and Σ = Pr i=1 xix i /r.

Proof. The identity xi = ti A xi + xi implies ti(Σ 1/2A Σ1/2)Σ 1/2xi = Σ 1/2(xi xi).

Summing both sides with weights αi and multiplying by x i (Σ 1/2) yields

i=1 αiΣ 1/2(xi xi)x i (Σ 1/2) =

i=1 αiti(Σ 1/2A Σ1/2)Σ 1/2xix i (Σ 1/2)

= Σ 1/2A Σ1/2 r X

i=1 αitiΣ 1/2xix i (Σ 1/2) .

By construction of Σ 1/2 and αi = 1/(tir), Pr i=1 αitiΣ 1/2xix i (Σ 1/2) = I. Therefore, Pr i=1 αiΣ 1/2(xi xi)x i (Σ 1/2) = Σ 1/2A Σ1/2. When x spans Rd, Σ 1/2 is both invertible and symmetric giving the theorem statement.

3.3 Learning multiple transformations

In the case of multiple transformations, the deﬁnition of recovering any single transformation matrix A k is ambiguous since given transformations A 1 and A 2, the matrices A 1 + A 2 and A 1 A 2 both locally generate the same family of transformations. We will refer to the transformations A RK d d and strengths t Rn K as disentangled if t t/r = σ2I for a scalar σ2 > 0. This criterion implies that the activation strengths are uncorrelated across the observed data. We will later

show in section 3.4 that this deﬁnition of disentangling captures our intuition, has a closed form estimate, and is closely connected to our optimization problem.

We show an analogous result to the one-transformation case (Lemma 3.2) which shows that given r > K2 samples we can ﬁnd weights α Rr k which reconstruct any of the K disentangled transformation matrices A k as A k Ak = Pr i=1 αik(xi xi)x i .

This implies that minimization over α leads to estimates of A . In contrast to Lemma 3.2, the multiple transformation recovery guarantee is probabilistic and inexact. This is because each summand (xi xi)x i contains effects from all K transformations, and there is no weighting scheme which exactly isolates the effects of a single transformation A k. Instead, we utilize the randomness in t to estimate A k by approximately canceling the contributions from the K 1 other transformations.

Theorem 3.3. Let x1 . . . xr Rd be i.i.d isotropic random variables and for each k [K], deﬁne t1,k . . . tr,k R as i.i.d draws from a symmetric random variable with t t/r = σ2I Rd d, tik < C1, and xi 2 < C2 with probability one.

Given x1 . . . xr, and neighbors x1 . . . xr deﬁned as xi = PK k=1 tik A kxi + xi for some A k Rd d, there exists α Rr K such that for all k [K],

i=1 αik(xi xi)x i

rε2 supk A k 2

2K2(2C2 1C2 2(1 + K 1 supk A k 1 ε)

Proof. We give a proof sketch and defer the details to the supplement (Section 7). We claim that for any k, αik = tik

σ2r satisﬁes the theorem statement. Following the one-dimensional case, we can expand the outer product in terms of the transformation A as

i=1 αik(xi xi)x i =

i=1 αiktik xix i .

As before, we must now control the inner terms Zk k = Pr i=1 αiktik xix i . We want Zk k to be close to the identity when k = k and near zero when k = k. Our choice of αik = tik

σ2r does this since if k = k then αiktik are zero mean with random sign, resulting in Rademacher concentration bounds near zero, and if k = k then Bernstein bounds show that Zk k I since E[αikti] = 1.

3.4 Disentangling transformations

Given K estimated transformations A1 . . . AK Rd d and strengths t Rn K, any invertible matrix W RK K can be used to ﬁnd an equivalent family of transformations ˆAi = P k Wik Ak and ˆtik = P j W 1 kj tij.

Despite this unidentiﬁability, there is a choice of ˆA1 . . . ˆAK and ˆt which is equivalent to A1 . . . AK but disentangled, meaning that across the observed transformation pairs {(xi, xi)}n i=1, the strengths for any two pairs of transformations are uncorrelated ˆt ˆt/n = I. This is a necessary condition to captures the intuition that two disentangled transformations will have independent strength distributions. For example, given a set of images generated by changing lighting conditions and sharpness, we expect the sharpness of an image to be uncorrelated to lighting condition.

Formally, we will deﬁne a set of ˆA such that: ˆt j and ˆt i are uncorrelated over the observed data, and any pair of transformations ˆAix and ˆAjx generate decorrelated outputs. In contrast to mutual information based approaches to ﬁnding disentangled representations, our approach only seeks to control second moments, but enforces decorrelation both in the latent space (tik) as well as the observed space ( ˆAix).

Theorem 3.4. Given Ak Rd d, t Rn k with P i tik = 0, deﬁne Z = USV Rn d2 as the SVD of Z, where each row is Zi = PK k=1 tikvec(Ak).

The transformation ˆAk = Sk,kmat(V k ) and strengths ˆtik = Uik fulﬁls the following properties:

k ˆtik ˆAkxi = P

k tik Akxi (correct behavior),

ˆt ˆt = I (uncorrelated in latent space),

E[ ˆAi X, ˆAj X ] = 0 for any i = j and random variable X with E[XX ] = I (uncorrelated in observed space).

Proof. The ﬁrst property follows since Z is rank-K by construction, and the rank-K SVD preserves P

k tik Ak exactly. The second property follows from the SVD, U U = I. The last property follows from V V = I, implying tr( ˆA i ˆAj) = 0 for i = j. By linearity of trace: E[ ˆAi X, ˆAj X ] = Si,i Sj,j tr(mat(Vi)mat(Vj) ) = 0.

Interestingly, this SVD appears in both the convex and subsampling algorithm (Eq. 6) as part of the proximal step for the trace norm optimization. Thus the rank sparsity induced by the trace norm naturally favors a small number of disentangled transformations.

4 Experiments

We evaluate the effectiveness of our sampling-based convex relaxation for learning transformations in two ways. In section 4.1, we check whether we can recover a known set of rotation / translation transformations applied to a downsampled celebrity face image dataset. Next, in section 4.2 we perform a qualitative evaluation of learning transformations over raw celebrity faces (Celeb A) and MNIST digits, following recent evaluations of disentangling in adversarial networks [2].

4.1 Recovering known transformations

We validate our convex relaxation and sampling procedure by recovering synthetic data generated from known transformations, and compare these to existing approaches for learning linear transformations. Our experiment consists of recovering synthetic transformations applied to 50 image subsets of a downsampled version (18 18) of Celeb A. The resolution and dataset size restrictions were due to runtime restrictions from the baseline methods.

We compare two versions of our matrix Lie group learning algorithm against two baselines. For our method, we implement and compare convex relaxation with sampling (Eq. 6) and convex relaxation and sampling followed by gradient descent. This second method ensures that we achieve exactly the desired number of transformations K, since trace norm regularization cannot guarantee a ﬁxed rank constraint. The full convex relaxation (Eq. 5) is not covered here, since it is too slow to run on even the smallest of our experiments.

As baselines, we compare to gradient descent with restarts on the nonconvex objective (Eq. 3) and the EM algorithm from Miao and Rao [11] run for 20 iterations and augmented with the SVD based disentangling method (Theorem 3.4). These two methods represent the two classes of existing approaches to estimating general linear transformations from pairwise data [11].

Optimization for our methods and gradient descent use minibatch proximal gradient descent with Adagrad [8], where the proximal step for trace norm penalties use subsampling down to ﬁve thousand points and randomized SVD. All learned transformations were disentangled using the SVD method unless otherwise noted (Theorem 3.4).

Figures 1a and b show the results of recovering a single horizontal translation transformation with error measured in operator norm. Convex relaxation plus gradient descent (Convex+Gradient) achieves the same low error across all sampled 50 image subsets. Without the gradient descent, convex relaxation alone does not achieve low error, since the trace norm penalty does not produce exactly rank-one results. Gradient descent on the other hand gets stuck in local minima even with stepsize tuning and restarts as indicated by the wide variance in error across runs. All methods outperform EM while using substantially less time.

Next, we test disentangling and multiple-transformation recovery for random rotations, horizontal translations, and vertical translations (Figure 1c). In this experiment, we apply the three types of transformations to the downsampled Celeb A images, and evaluate the outputs by measuring the minimum-cost matching for the operator norm error between learned transformation matrices and

the ground truth. Minimizing this metric requires recovering the true transformations up to label permutation.

We ﬁnd results consistent with the one-transform recovery case, where convex relaxation with gradient descent outperforms the baselines. We additionally ﬁnd SVD based disentangling to be critical to recovering multiple transformations. We ﬁnd that removing SVD from the nonconvex gradient descent baseline leads to substantially worse results (Figure 1c).

(a) Operator norm error for recovering a single translation transform

(b) Sampled convex relaxations are faster than baselines

(c) Multiple transformations can be recovered using SVD based disentangling

Figure 1: Sampled convex relaxation with gradient descent achieves lower error on recovering a single known transformation (panel a), runs faster than baselines (panel b) and recovers multiple disentangled transformations accurately (panel c).

4.2 Qualitative outputs

We now test convex relaxation with sampling on MNIST and celebrity faces. We show a subset of learned transformations here and include the full set in the supplemental Jupyter notebook.

(a) Thickness

(c) Loop size

Figure 2: Matrix transformations learned on MNIST (top rows) and extrapolating on Kannada handwriting (bottom row). Center column is the original digit, ﬂanking columns are generated by applying the transformation matrix.

On MNIST digits we trained a ﬁve-dimensional linear transformation model over a 20,000 example subset of the data, which took 10 minutes. The components extracted by our approach represent coherent stylistic features identiﬁed by earlier work using neural networks [2] such as thickness, rotation as well as some new transformations loop size and blur. Examples of images generated from these learned transformations are shown in ﬁgure 2. The center column is the original image and all other images are generated by repeatedly applying transformation matrices). We also found that the transformations could also sometimes extrapolate to other handwritten symbols, such as Kannada handwriting [5] (last row, ﬁgure 2). Finally, we visualize the learned transformations by summing the estimated transformation strength for each transformation across the minimum spanning tree on the observed data (See supplement section 9 for details). This visualization demonstrates that the learned representation of the data captures the style of the digit, such as thickness and loop size and ignores the digit identity. This is a highly desirable trait for the algorithm, as it means that we can extract continuous factors of variations such as digit thickness without explicitly specifying and removing cluster structure in the data (Figure 3).

Figure 3: Embedding of MNIST digits based on two transformations: thickness and loop size. The learned transformations captures extracts continuous, stylistic features which apply across multiple clusters despite being given no cluster information.

(b) Info GAN

Figure 4: Baselines applied to the same MNIST data often entangle digit identity and style.

In contrast to our method, many baseline methods inadvertently capture digit identity as part of the learned transformation. For example, the ﬁrst component of PCA simply adds a zero to every image (Figure 4), while the ﬁrst component of Info GAN has higher ﬁdelity in exchange for training instability, which often results in mixing digit identity and multiple transformations (Figure 4).

Finally, we apply our method to the celebrity faces dataset and ﬁnd that we are able to extract high-level transformations using only linear models. We trained a our model on a 1000-dimensional PCA projection of Celeb A constructed from the original 116412 dimensions with K = 20, and found both global scene transformation such as sharpness and contrast (Figure 5a) and more high level-transformations such as adding a smile (Figure 5b).

(a) Contrast / Sharpness

(b) Smiling / Skin tone

Figure 5: Learned transformations for celebrity faces capture both simple (sharpness) and high-level (smiling) transformations. For each panel, the center column is the original image, and columns to the left and right were generated by repeatedly applying the learnt transformation.

5 Related Work and Discussion

Learning transformation matrices, also known as Lie group learning, has a long history with the closest work to ours being Miao and Rao [11] and Rao and Ruderman [12]. These earlier methods use a Taylor approximation to learn a set of small (< 10 10) transformation matrices given pairs of image patches undergoing a small transformation. In contrast, our work does not require supervision in the form of transformation pairs and provides a scalable new convex objective function.

There have been improvements to Rao and Ruderman [12] focusing on removing the Taylor approximation in order to learn transformations from distant examples: Cohen and Welling [3, 4] learned commutative and 3d-rotation Lie groups under a strong assumption of uniform density over rotations. Sohl-Dickstein et al. [14] learn commutative transformations generated by normal matrices using eigendecompositions and supervision in the form of successive 17 17 image patches in a video. Our work differs because we seek to learn multiple, general transformation matrices from large, high-dimensional datasets. Because of this difference, our algorithm focuses on scalability and avoiding local minima at the expense of utilizing a less accurate ﬁrst-order Taylor approximation. This approximation is reasonable, since we ﬁt our model to nearest neighbor pairs which are by deﬁnition close to each other. Empirically, we ﬁnd that these approximations result in a scalable algorithm for unsupervised recovery of transformations.

Learning to transform between neighbors on a nonlinear manifold has been explored in Dollár et al. [7] and Bengio and Monperrus [1]. Both works model a manifold by predicting the linear neighborhoods around points using nonlinear functions (radial basis functions in Dollár et al. [7] and a one-layer neural net in Bengio and Monperrus [1]). In contrast to these methods, which begin with the goal of learning all manifolds, we focus on a class of linear transformations, and treat the general manifold problem as a special kernelization. This has three beneﬁts: ﬁrst, we avoid the high model complexity necessary for general manifold learning. Second, extrapolation beyond the training data occurs explicitly from the linear parametric form of our model (e.g., from digits to Kannada). Finally, linearity leads to a deﬁnition of disentangling based on correlations and a SVD based method for recovering disentangled representations.

In summary, we have presented an unsupervised approach for learning disentangled representations via linear Lie groups. We demonstrated that for image data, even a linear model is surprisingly effective at learning semantically meaningful transformations. Our results suggest that these semi-parametric transformation models are promising for identifying semantically meaningful low-dimensional continuous structures from high-dimensional real-world data.

Acknowledgements.

We thank Arun Chaganty for helpful discussions and comments. This work was supported by NSF-CAREER award 1553086, DARPA (Grant N66001-14-2-4055), and the DAIS ITA program (W911NF-16-3-0001).

Reproducibility.

Code, data, and experiments can be found on Codalab Worksheets (http://bit.ly/2Aj5tti).