# unlabeled_principal_component_analysis__dde971e1.pdf

Unlabeled Principal Component Analysis

Yunzhen Yao, Liangzu Peng, Manolis C. Tsakiris School of Information Science and Technology Shanghai Tech University yaoyzh,penglz,mtsakiris@shanghaitech.edu.cn

We introduce robust principal component analysis from a data matrix in which the entries of its columns have been corrupted by permutations, termed Unlabeled Principal Component Analysis (UPCA). Using algebraic geometry, we establish that UPCA is a well-deﬁned algebraic problem in the sense that the only matrices of minimal rank that agree with the given data are row-permutations of the ground-truth matrix, arising as the unique solutions of a polynomial system of equations. Further, we propose an efﬁcient two-stage algorithmic pipeline for UPCA suitable for the practically relevant case where only a fraction of the data have been permuted. Stage-I employs outlier-robust PCA methods to estimate the ground-truth column-space. Equipped with the column-space, Stage-II applies recent methods for unlabeled sensing to restore the permuted data. Experiments on synthetic data, face images, educational and medical records reveal the potential of UPCA for applications such as data privatization and record linkage.

1 Introduction

1.1 Motivation

In principal component analysis, a cornerstone of machine learning and data science, one is given a data matrix X, assumed to be a corrupted version of a ground-truth data matrix X = [x 1 x n] Rm n, typically but not necessarily assumed to have low rank, and the objective is to estimate X or the column-space S Rm of X . The most common types of corruptions that have attracted interest in modern studies are additive sparse perturbations [6, 47], outlier data points that lie away from S [43, 40], and missing entries, the latter also known as low-rank matrix completion [7, 4].

Recently, permutations have been emerging as another type of data corruption, typically set in the context of linear regression, where the correspondences between the input and the output data have been partially distorted or even are entirely unavailable [38, 39, 22, 17, 30, 31, 46, 19, 41, 36]. There, one is given a point x of a linear subspace S , but only up to a permutation of its coordinates, say x = Π x with Π an unknown permutation, and the goal is to ﬁnd x from the data x, S . This Unlabeled Sensing [38, 39] problem has many potential machine learning applications, e.g., record linkage [30, 31], visual [26, 27] or textual [5, 28, 29] permutation learning, and matching problems in neuroscience [21] and biology [1, 42].

While methods for unlabeled sensing rely on knowledge of the source subspace S , this is not always known in practice. On the other hand, data of the form X = [ x1, . . . , xn] Rm n with xj = Π j x j an unknown permutation of an unknown point x j S , are often available, thus raising the question of whether S can be estimated from X. An important example of this situation is record linkage [13, 20, 3], where the objective is to integrate data from independent sources, x1, . . . , xn Rm, for subsequent data analysis. Since the entries of different records xi s are collected separately, the data matrix X is unlabeled in the sense that, the entries of its i-th row do not necessarily correspond

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

to the same entity. Such kind of unlabeled data X also arise in the context of data privatization, where the data provider anonymizes the original data X by permuting each column of X prior to release [11, 16]. Data re-identiﬁcation is a concern since companies with privacy policies, health care providers, and ﬁnancial institutions may release the collected data after anonymization. Understanding the fundamental limits of re-identifying the original data X from the released ones X is essential for striking a balance between data privacy and data preservation, the subject of a plenary talk in the 2019 International Conference on Machine Learning [2].

1.2 Contributions

In this paper we consider the recovery of X from its unlabeled version X, which we term Unlabeled Principal Component Analysis (UPCA). We make the following contributions:

We establish that as long as r := rank(X ) < min{m, n} and X is generic (see Deﬁnition 1), then up to a permutation of its rows, X is the only matrix of rank less than or equal to r that is compatible with X. This asserts that UPCA is a well-posed problem, since the inherent ambiguity of whether X comes from X or a row-permuted version of X is in most cases practically harmless. We establish that in this basic formulation UPCA is a purely algebraic problem, by exhibiting a polynomial system of equations parametrized by X, whose solutions are all the rowpermutations of X ; solving the UPCA problem amounts to obtaining one such solution. Inasmuch as solving this polynomial system is in principle NP-hard, we introduce an efﬁcient algorithmic pipeline for the practically relevant case where a signiﬁcant part of the data have undergone the same permutation, while the rest of the points have been permuted arbitrarily; in the case of record linkage this would correspond to one of the records having much larger size than the others. The ﬁrst stage of the pipeline employs PCA methods with robustness to outliers [43, 33, 25, 45, 37, 48, 18] to produce an estimate S of S from X; the second stage of the pipeline uses unlabeled sensing methods [30, 32, 36, 23] to furnish an estimate X of X from S and X. We introduce a simple but efﬁcient algorithm for unlabeled sensing based on least-squares with recursive ﬁltration (Algorithm 2). We assess our algorithmic pipeline on synthetic data, face images, educational and medical records, with encouraging results.

1.3 Related work

Unlabeled sensing. There is a large literature on application-speciﬁc problems that involve lack of correspondences, e.g. in computer vision or statistics; here we just review four recent methods for unlabeled sensing that will be used in this paper. In unlabeled sensing one is given a subspace S Rm of dimension r and a point x which is some permuted version of a point x S and the goal is to recover x from S and x. A critical distinction among methods in the literature is the sparsity level α of the permutation, that is the ratio of coordinates that are moved by the permutation.

The case of dense permutations (α = 1) is extremely challenging, with existing methods only able to handle small ranks r. We consider two methods known to perform best in this regime. The ﬁrst one is the algebraic-geometric AIEM method of [36]. This has linear complexity in m and instead concentrates its effort on solving a polynomial system of r equations in r variables to produce an initialization for an expectation maximization algorithm. Currently, this method is efﬁcient for r 5 and intractable otherwise. A very different method is CCV-Min of [23], which proceeds via branch-&-bound together with concave minimization and can handle ranks r 8.

For sparse permutations (small α) dealing with higher ranks becomes possible [30, 32]. The ℓ1-RR algorithm of [30] applies an ℓ1 robust linear regression relaxation, and it works when α 0.5. Another approach is the Pseudo-Likelihood method (PL) of [32], which ﬁts a two-component mixture density for each entry of x, one accounting for ﬁxed data and the other for permuted data. The ﬁtting is done via a combination of hypothesis testing, reweighed least-squares, and alternating minimization. Empirically, PL can tolerate up to α = 0.7 but it is sensitive to the particular basis of S used to generate x.

Robust PCA with outliers. PCA methods with robustness to outliers will also play a role in this paper. Among a large literature, we review four modern methods. In that context, it is assumed that X can be partitioned into inlier points that lie in an unknown r-dimensional subspace S Rm and outlier points that lie away from S , and the goal is to recover S from X. The Outlier Pursuit (OP) method of [43] decomposes X via convex optimization into the sum of low-rank and column-sparse parts. The convex method of [45], a successor of [33], referred to as Self-Expr, solves a self-expressive elastic net problem so that each xj is expressed as an ℓ2-regularized sparse linear combination of the other points. The self-expressive coefﬁcients deﬁne transition probabilities of a random walk on a graph, and the average of the t-step transition probability distributions is used as a score for inliers vs. outliers, with higher scores expected for the former. The Coherence Pursuit (Co P) method of [25] is based on the following simple but effective principle: with X j the matrix X with column j removed, for each xj one computes its coherence X j xj with the rest of the points. As it turns out, inliers tend to have coherences of higher ℓ2-norm than outliers, and the top r xj s are taken to span S. Finally, the Dual Principal Component Pursuit (DPCP) of [37] solves an ℓ1 non-smooth problem on the Grassmannian via iterated-reweighed least-squares (IRLS), to compute an orthonormal basis for the orthogonal complement of the subspace. A dual formulation with a randomized singular value decomposition incorporated in the IRLS procedure is known as Fast Median Subspace [18]. In sharp contrast to other convex robust-PCA methods, DPCP has been shown in [48, 9] to tolerate as many outliers as the square of the number of inliers.

2 Theoretical Foundations

2.1 Problem Formulation

Let us denote by Pm the set of all permutations of coordinates of Rm. We let X = [x 1 x n] Rm n be our ground-truth data matrix with rank r < min{m, n} and column-space S = C(X ), and we suppose that the available data are X = [ x1 xn] = [Π 1x 1 Π nx n] Rm n, (1)

where each Π j Pm is an unknown permutation. Let Pn m = Q

i [n] Pm be n ordered copies of Pm, where [n] = {1, . . . , n}. For π = (Π1, . . . , Πn) Pn m we set π( X) = [Π1 x1 Πn xn]. We pose Unlabeled Principal Component Analysis (UPCA) as the following rank minimization problem:

min π Pn m rank π( X) (2)

First, note that (2) never has a unique solution, because if π = (Π1, . . . , Πn) is a solution, then so is π = (ΠΠ1, . . . , ΠΠn), where Π Pm is any permutation. This reveals an inherent ambiguity of UPCA: it is only possible to recover X from X up to a permutation ΠX of its rows. On the other hand, this is rather harmless in many situations, since ΠX is the same dataset as X except that the row-features appear now in some different order. Thus, our hope in formulating (2) is that the only solutions are of the form π = (ΠΠ 1 , . . . , ΠΠ n ) with Π ranging across Pm and Π j as in (1). However, without any other assumptions on the data X , there could in principle be additional undesired permutations that also give rank X , or even worse, the minimum rank in (2) could be lower than r = rank X . Our results show that for generic enough data, such pathological situations do not occur, and the only solutions to (2) are the ones associated with row-permutations of X .

2.2 Elements of Algebraic Geometry

Before stating our results, we make the notion of generic precise using some basic algebraic geometry [8, 15]. Let Z = (zij) be an m n matrix of variables zij and R[Z] = R zij : i [m], j [n] the ring of polynomials in the zij s with real coefﬁcients. An algebraic variety of Rm n is the set of solutions of a polynomial system of equations in R[Z]. In particular, the set of (r + 1) (r + 1) determinants of Z are polynomials in zij s of degree r + 1 and deﬁne the algebraic variety

Mr = { X Rm n| rank X r } ,

since a matrix X Rm n has rank at most r if and only if all (r + 1) (r + 1) determinants of X are zero.

The algebraic variety Mr admits a topology, called Zariski topology, which makes it convenient to work with. The closed sets in this topology are the algebraic subvarieties of Mr. These are sets of matrices of rank r, which in addition satisfy certain other polynomial equations in R[Z]. For example, the set of matrices of rank at most r 1 is a proper closed subset of Mr, because in addition to the equations deﬁning Mr, it is further deﬁned by requiring all r r determinants to be zero. Open sets in Mr are deﬁned as complements of closed sets, or equivalently they are deﬁned by requiring that certain sets of polynomials are not all simultaneously zero. For example, the set of matrices of rank exactly equal to r is a proper open subset of Mr deﬁned by the non-simultaneous vanishing of all r r determinants of Z; a matrix has rank r if and only if all (r + 1) (r + 1) determinants are zero and least one r r determinant is non-zero. Now, the algebraic variety Mr is irreducible in the sense that it can not be described as the union of two proper algebraic subvarieties of it. A consequence of this is that non-empty open sets of Mr have the very important property of being topologically dense. This means that given a non-empty open set U Mr and a point X Mr, every neighborhood of X intersects U. It follows that under any non-degenerate continuous probability measure on Mr, a non-empty Zariski-open set of Mr has measure 1. For example, the set of matrices in Mr of rank r is non-empty and open, and thus it is dense. Hence a randomly sampled matrix in Mr under a continuous probability measure will have rank r with probability 1. We refer to such a fact by saying that a generic matrix in Mr has rank r. More generally:

Deﬁnition 1. We say that a generic matrix in Mr satisﬁes a property, if the property is true for every matrix in a non-empty open subset of Mr.

2.3 UPCA is a Well-Posed Problem

Our main theoretical result is1:

Theorem 1. For X a generic matrix in Mr, we have that rank π( X) r for any π Pn m, with equality if and only if π( X) = ΠX for some Π Pm.

Theorem 1 says that for X Mr generic, and up to a permutation of the coordinates of Rm, S is the unique r-dimensional subspace that explains the data X in the UPCA sense, and r = rank X is the minimum objective in (2).

2.4 UPCA is an Algebraic Problem

How can one go about solving the discrete optimization problem (2)? In general, brute force selection of the Πj s has complexity O (m!)n , which is out of the question. On the other hand, problem (2) has a rich algebraic structure, which we harvest by showing that X , up to a permutation of its rows, arises as the unique solution to a polynomial system of equations.

To begin with, for each j [n] and each ℓ [m], we deﬁne the following column-symmetric polynomials of R[Z]:

pℓ,j(Z) := X

i [m] zℓ ij

pℓ,j(Z) := pℓ,j(Z) pℓ,j( X)

Note that pℓ,j π(Z) = pℓ,j(Z) for any π Pn m and thus pℓ,j( X) = pℓ,j(X ). Now let us think of X Mr as a product of two matrices of size m r and r n, and let us deﬁne another polynomial ring with variables associated to these two factors. For i = r + 1, . . . , m, and k [r] and j [n], we let bik, ckj be a new set of variables over R. Organize the bik s to occupy the (m r) r bottom block of an m r matrix B whose top r r block is the identity matrix of size r, and the ckj s into a k n matrix C = (ckj). For i [m], we write b i for the i-th row of B; for j [n], we write cj for the j-th column of C. With xij, x ij the i-th coordinate of xj, x j respectively, we obtain polynomials qℓ,j for ℓ [m], j [n] of R[B, C] by substituting zij 7 b i cj in the pℓ,j(Z) s above:

qℓ,j(B, C) := pℓ,j(BC) pℓ,j( X) = X

i [m] (b i cj)ℓ X

i [m] xℓ ij = X

i [m] (b i cj)ℓ X

i [m] x ij ℓ

1The proofs of Theorems 1-3 can be found in [44].

The set of common roots of all qℓ,j s deﬁnes an algebraic variety that depends only on X , given by

YX := (B , C ) Rm r Rr n | qℓ,j(B , C ) = 0, ℓ [m], j [n]; B [r],[r] = Ir ,

where B [r],[r] = Ir signiﬁes that the top r r block of B Rm r is the identity matrix. Let us get a feeling about the points of YX . With Π Pm, if the column-space C(ΠX ) of ΠX does not drop dimension upon projection onto the ﬁrst r coordinates, then there exists a unique basis B Π of C(ΠX ) with the identity matrix occurring at the top r r block. In that case, there is a unique factorization ΠX = B ΠC Π and the point (B Π, C Π) lies in the variety YX because

qℓ,j(B Π, C Π) = pℓ,j(B ΠC Π) pℓ,j( X) = pℓ,j(ΠX ) pℓ,j(X ) = pℓ,j(X ) pℓ,j(X ) = 0.

Our second result says that if X is generic, then all points of YX are of this type. That is, they correspond to factorizations B ΠC Π of ΠX as Π varies across all permutations: Theorem 2. For a generic matrix X in Mr we have

YX = (B Π, C Π) Rm r Rr n | Π Pm; B Π,[r],[r] = Ir; ΠX = B ΠC Π .

Thanks to Theorem 2 we have the following important conceptual ﬁnding. Assuming X is generic, to obtain X up to some permutation of its rows from X, one needs to compute an arbitrary root (B , C ) of the polynomial system of equations

qℓ,j(B, C) = 0, ℓ [m], j [n] (3)

and multiply its factors to get B C . Developing a polynomial system solver for UPCA would involve two main challenges: attaining robustness to noise and scalability, with the former typically easier to deal with than the latter. We leave such an endeavor to future research.

2.5 UPCA with Dominant Permutations

We close this section with a special case of interest, where part of the data have undergone the same dominant permutation; in fact, given the inherent ambiguity of UPCA discussed above, we may as well take this dominant permutation to be the identity matrix Im of size m m. To make this precise, we deﬁne the multiplicity µ(Π) of a permutation Π Pm to be the number of times that Π appears as Π = Π j in (1) with j ranging in [n]. We have:

Theorem 3. Suppose that µ(Im) r + 1 while µ(Π) < r for any other Π = Im. Then for a generic X Mr, we have that S is the unique solution to the following consensus maximization problem

max dim S r #{ xj | xj S ; j [n]}, (4)

where # denotes the cardinality of a set, and the maximization is taken over all subspaces S Rm of dimension r.

Theorem 3 says that for sufﬁciently generic ground truth data X , the given data X admit a natural partition into a set of inliers and outliers with respect to the linear subspace S :

Xin := { xj | xj S }, Xout := { xj | xj S }

Of course we do not know what the partition into inliers and outliers is, because we do not know what S is. But the presence of this geometric structure is enough for PCA methods with robustness to outliers to operate on X in order to estimate S . The rest of the paper proceeds algorithmically building on this insight.

3 Algorithms

3.1 Two-stage algorithmic pipeline for UPCA

We saw in the previous section that the UPCA problem (2) is well-deﬁned (Theorem 1) and in principle solvable by a polynomial system of equations (Theorem 2). However, this polynomial system is at the moment intractable to solve even for moderate dimensions. On the other hand, Theorem 3 suggests the following algorithmic pipeline, for the case where there is a dominant

permutation, which we can take to be the identity as in the theorem. At Stage-I of the pipeline, a PCA method with robustness to outliers [43, 33, 25, 45, 37, 48, 18] is employed to produce an estimate S of S from X. At Stage-II of the pipeline, one feeds S and X to an unlabeled sensing method [30, 32, 36, 23] which operates point by point, returning for every xj an estimate xj of x j . Here one may choose to threshold the xj s based on their distance to S and apply unlabeled sensing on the outliers only. Alternatively, if extra computational power is available for dispensing with choosing a threshold, one may apply unlabeled sensing on every xj; it is this approach that we follow in the rest of the paper. This procedure is summarized in Algorithm 1.

Algorithm 1 Two-stage Algorithmic Pipeline for UPCA

1: Input: observed data matrix X, rank r 2: estimate S of S outlier-robust PCA on X Stage-I 3: for j = 1, . . . , n do Stage-II 4: estimate xj of x j unlabeled sensing on ( xj, S) 5: end for 6: return X = [ x1, . . . , xn] estimate of X

3.2 A new method for unlabeled sensing: Least-Squares with Recursive Filtration (LSRF)

Inasmuch as there are very few scalable unlabeled sensing methods, we here propose a simple but comparatively efﬁcient alternative, Algorithm 2. This method is parameter-free. It alternates between ordinary least-squares and a dimensionality reduction step that removes the coordinate of the ambient space on which the residual error attains its maximal value, until r coordinates are left. The number of iterations is ﬁxed as (m r), so the overall complexity of Algorithm 2 is O(m2r2). We here also mention the time complexity of other unlabeled sensing methods for comparison: AIEM has complexity approximately O(m + rr), PL has complexity at least O(kmr3) via second-order optimization, and ℓ1-RR has complexity at least O(mr3 + k(m + r2)) via sub-gradient descent, where k is the number of iterations.

Algorithm 2 Unlabeled Sensing via Least-Squares with Recursive Filtration (LSRF)

1: Input: permuted point xj, basis B of subspace S

2: v(0) xj, A(0) B

3: for k = 1, . . . , m r do 4: c A(k 1) v(k 1)

5: i argmaxi |v(k 1) i A(k 1) i c| 6: remove the i -th entry of v(k 1) to get v(k)

7: remove the i -th row of A(k 1) to get A(k)

8: end for 9: return xj = A(m r)A(m r) v(m r) estimate for x j

4 Experimental Evaluation

4.1 Robust-PCA with permutation-induced outliers

We begin by assessing the performance of Stage-I of the pipeline. This entails understanding how different PCA methods with robustness to outliers behave when the outliers are induced by permutations, as in Theorem 3. We consider Self-Expr [45, 33], Co P [25], OP [43], and DPCP [37]; see related work in section 1. We ﬁx m = 50, n = 500. With dim S = r = 1 : 1 : 49, we sample S at random from the Grassmannian Gr(r, m). Then n points x j are sampled at random from the intersection of S with the unit sphere of Rm to yield X . Let nin be the number of inliers Xin and nout the number of outliers Xout, with nin + nout = n. We consider outlier ratios nout/n = 0.1 : 0.1 : 0.9. With ﬁxed nout/n, we set Π j to the identity for j [nin] and set Π j s for j > nin as follows. We consider both dense (α = 1) and sparse (α = 0.1) permutations. For a ﬁxed α, we obtain Π j

by randomly choosing αm coordinates for each x j and applying a random permutation on those coordinates. As evaluation metric we use the largest among all r principal angles between S and S , denoted by θmax(S , S). Recall θmax(S , S) = 0 if and only if S = S .

Self-Expr [45, 33] Co P [25] OP [43] DPCP [37]

10 20 30 40

random outliers

10 20 30 40

0.1 0.3 0.5 0.7 0.9

10 20 30 40

0.1 0.3 0.5 0.7 0.9

0.1 0.3 0.5 0.7 0.9

0.1 0.3 0.5 0.7 0.9

outlier ratio

Figure 1: θmax(S , S) in Stage-I: outlier ratio vs. rank phase transitions for various PCA methods with robustness to outliers.

Figure 1 depicts the outlier-ratio versus rank phase transitions, where to calibrate the analysis with what we know about these methods from prior work, we have included in the top row of the ﬁgure the phase transitions for outliers randomly chosen from the unit sphere. By reading that top row we recall: i) DPCP has overall the best performance across all ranks and all outlier ratios, ii) OP identiﬁes correctly S only in the low rank low outlier-ratio regime, as expected from its conceptual formulation, and iii) Co P and Self-Expr, even though low-rank methods in spirit, they have accuracy similar to each other and considerably better than OP. We also note that Co P is the fastest method requiring 0.51sec for the computation of a single phase transition plot, Self-Expr is the slowest with 752sec, and DPCP and OP take 1.31sec and 5.62sec, respectively2.

Now let us look at what happens for permutation-induced outliers. For α = 1, where the permutations move all the coordinates of the points they are corrupting, we see that the phase transition plots are practically the same as for random outliers. In other words, obtaining the outliers by randomly permuting all coordinates of inlier points, with different permutations for different outliers, seems, even for low subspace dimensions, to be yielding an outlier set as generic for the task of subspace learning as sampling the outliers randomly from the unit sphere. A second interesting phenomenon is observed when the permutation ratio is decreased to α = 0.1. In that regime the methods exhibit two very different trends. On one hand, Co P and Self-Expr appear to break down, which is expected, because as the permutations become more sparse, the outlier points become more coherent with the rest of the data set. On the other hand, the accuracy of DPCP and OP improves for sparser permutations; a justiﬁcation for this is that both methods get initialized via the SVD of X, which yields a subspace closer to S for smaller α. An interesting research direction is to analyze the theoretical guarantees of these methods for this speciﬁc type of outliers.

4.2 UPCA on synthetic data

Next, we evaluate the UPCA pipeline of Algorithm 1 on synthetic data. We keep m = 50 as before, and add spherical noise corresponding to a ﬁxed SNR of 40d B. We get the estimate S of S via DPCP [37] in Stage-I and apply the unlabeled sensing methods [36, 23, 30, 32] and Algorithm 2 in Stage-II to get X from S and X. We distinguish between dense and sparse permutations.

2Experiments are run on an Intel(R) i7-8700K, 3.7 GHz, 16GB machine.

UPCA with dense permutations (α = 1). Figure 2 depicts the relative estimation error of X for different outlier ratios from 75% (25 inliers) to 94% (6 inliers) and ranks r = 3, 4, 5. To assess the overall effect of the quality of S, we use two versions of AIEM and CCV-Min. The ﬁrst, denoted by AIEM( S) and CCV-Min( S), uses as input the estimated subspace S, while the second version, AIEM(S ) and CCV-Min(S ), uses the ground-truth subspace S . Note that the estimation error of AIEM(S )/CCV-Min(S ) is independent of the outlier ratio. On the other hand, the estimation error of AIEM( S)/CCV-Min( S) depends on the outlier ratio through the computation of S. Indeed, S is expected to be closer to S for smaller outlier ratios, as we already know from Figure 1. In particular, for up to 75% outliers the estimation error of AIEM( S)/ CCV-Min( S) coincides with that of AIEM(S )/ CCV-Min(S ), indicating an accurate estimation of S . At the other extreme, for 94% outliers both AIEM( S)/ CCV-Min( S) break down, indicating that the estimation of S failed. Finally, note that CCV-Min has at least half order of magnitude smaller estimation error than AIEM. This is due to our speciﬁc choice of the branch-&-bound CCV-Min parameters which control the trade-off between accuracy and running time; for example, for r = 3 and 75% outliers, AIEM runs in 0.042sec with 1% error, while CCV-Min needs about 15sec to bound X 0.42% away from X .

94% 90% 85% 80% 75% outlier ratio

94% 90% 85% 80% 75% outlier ratio

94% 90% 85% 80% 75% outlier ratio

Figure 2: UPCA (Algorithm 1) for dense permutations (α = 1) with S produced by DPCP [37] at Stage-I and X produced by AIEM [36] or CCV-Min [23] at Stage-II.

UPCA with sparse permutations (α 0.6). Figures 3b-3d show the relative estimation error of UPCA for α = 0.1 : 0.1 : 0.6, rank r = 1 : 1 : 25, and outlier ratio ﬁxed to 90%, with S computed in Stage-I by DPCP [37] and X computed via ℓ1-RR [30], PL [32] or Algorithm 2 from X and S in Stage-II. It is important to note that r = 25 = m/2 is the largest rank for which unique recovery of X is theoretically possible [38, 39, 34, 35, 10, 24]. Figure 3a shows that θmax(S , S) always stays below 2 , indicating the success of DPCP. Now ℓ1-RR and Algorithm 2 have similar accuracy, but Algorithm 2 is more efﬁcient than ℓ1-RR, considering that computing X takes 0.3sec for Algorithm 2 and 1.5min for ℓ1-RR. Even though PL delivers X in 1sec, it is not performing as well, which we attribute to its sensitivity on the particular basis of S that is used to generate the data; this is not available here since DPCP returns the speciﬁc basis of dual principal components.

.1 .2 .3 .4 .5 .6 1

(a) θmax(S , S)

.1 .2 .3 .4 .5 .6 1

(b) ℓ1-RR [30]

.1 .2 .3 .4 .5 .6 1

(c) PL [32]

.1 .2 .3 .4 .5 .6 1

(d) Algorithm 2

Figure 3: Estimation error X X F

X F of UPCA (Algorithm 1) for sparse permutations (α 0.6) and

outlier ratio 90%, with S computed by DPCP [37] in Stage-I and X computed by ℓ1-RR [30], PL [32] or Algorithm 2 in Stage-II.

4.3 UPCA on face images

In this section we offer a ﬂavor of how the ideas discussed so far apply in a high-dimensional example with real data. We use the well-known database Extended Yale B [14], which contains ﬁxed-pose face images of distinct individuals, with 64 images per individual under different illumination conditions. It is well-established that the images of each individual approximately span a low-dimensional subspace. It turns out that for our purpose the value r = dim S = 4 is good enough. Since each image has size 192 168, the images of each individual can be approximately seen as n = 64 points x j , j [64] of a 4-dimensional linear subspace S , embedded in an ambient space of dimension m = 32256. In what follows we only deal with the images of a ﬁxed individual. We consider four permutation types corresponding to fully or partially (α = 0.4) permuting image patches of size 16 24 or 48 42, as shown in the second column of Figure 4. To generate a ﬁxed number of nout = 16 outliers, only one out of the four permutation types is used for each trial. The original images (inliers) together with the ones that have undergone patch-permutation (outliers) are given without any inlier/outlier labels, and the task is to restore all corrupted images. This is a special case of visual permutation learning, recently considered using deep networks [26, 27].

original outlier AIEM[36] ℓ1-RR[30] PL[32] Algorithm 2

Figure 4: UPCA on the face dataset Extended Yale B.

The ﬁrst column of Figure 4 shows an original image, and the second column shows the corresponding outlier obtained by applying a sample permutation for each of the four different permutation types. Columns three to six give the corresponding point in the output of Algorithm 1 for different unlabeled sensing methods and S computed by DPCP [37] (CCV-Min [23] is not included as branch-&-bound becomes prohibitively expensive for such large m). Notably, AIEM [36] rather satisfactorily restores the original image regardless of permutation type. The performance of the other three methods is shown only for their operational regime of sparse permutations, and Algorithm 2 most accurately captures the illumination of the original image. Overall, we ﬁnd these results encouraging, especially if one takes into consideration that the methods are very efﬁcient, requiring only 0.2sec (AIEM), 7sec (ℓ1-RR), 0.2sec (PL) and 10sec (Algorithm 2), discounting the DPCP step, which costs 0.1sec, regardless of permutation type. This is in contrast with existing deep network architectures for visual permutation learning, such as [27], which are based on branch-&-bound and thus have in principle an exponential complexity in the number of permuted patches.

4.4 UPCA on data re-identiﬁcation

Finally, we evaluate the UPCA Algorithm 1 for the task of data re-identiﬁcation (see section 1) using real educational and medical records and simulated permutations for various sparsity levels α, thus emulating a privacy protection scenario. Both of the datasets that we use contain no personally identiﬁable information. DPCP [37] computes S in Stage-I, and ℓ1-RR [30], PL [32], or Algorithm 2 produces X in Stage-II.

The ﬁrst dataset consists of the test scores of m = 707 high-school students on 6 subjects during two different periods, together with the sum of the score tests for each period, thus n = 14. For 7 out of 14 tests, we apply random permutations of the student indices and thus have 50% outliers. With r = 3, the relative estimation errors on the score records are shown in Figure 5a. The black dashed line depicts the relative difference between the observed data X and the original data X , which as expected increases for higher α s. Since ℓ1-RR, PL, and Algorithm 2 are by design suitable for sparse

0.2 0.4 0.6 0.8 1.0 0.05

(a) high-school scores

0.2 0.4 0.6 0.8 1.0 0

(b) breast tumor features

Figure 5: Relative estimation error on real data in de-anonymization.

permutations, their performance naturally degrades for large α. But up to α = 0.7 it aligns with our earlier ﬁndings in that Algorithm 2 tends to have superior performance on the average, and we also see that in that regime it also has the smallest variance.

The second dataset consists of all the benign cases in Breast Cancer Wisconsin (Diagnostic) [12]. It has m = 357 patients and n = 30 features of a breast mass digitized image for each patient. We randomly permute the patient indices for 15 of the features thus having 50% outliers and set r = 4. Figure 5b shows the relative estimation error of X for various permutation sparsity levels α, with the unlabeled sensing methods exhibiting the same trend as before. Remarkably, for α = 0.7, the UPCA Algorithm 1 incorporating Algorithm 2 in Stage-II reduces the original error of the data X from 32.24% to 6.35% in 0.5sec, as opposed to 15.90% and 19.57% when ℓ1-RR [30] or PL [32] are incorporated, respectively.

5 Discussion

Some interesting conclusions can be drawn from this work regarding privacy protection. First, it appears to be not secure for data providers to only partially permute the data, since, as we have seen, permuting only a subset of the features enables re-identiﬁcation. Re-identiﬁcation is also possible if a database is fully permuted but another uncorrupted database with the same subjects is accessible or present in an assembled dataset. On the other hand, arbitrary dense permutations are extremely difﬁcult to recover, as one would need to solve an intractable polynomial system.

A related interesting direction for future research is the case where on top of permutations one has missing entries, a problem that we term unlabeled matrix completion. The missing entries, may either arise naturally, since for example not all same subjects are common across databases, but they could also serve as part of a privacy protection protocol. [44] discusses the algebraic structure of unlabeled matrix completion and gives results regarding ﬁnite recovery.

Acknowledgement

This work was funded by Shanghai Tech start-up grant 2017F0203-000-16.

[1] A. Abid, J. Zou. A stochastic expectation-maximization approach to shufﬂed linear regression. In Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2018.

[2] J. M. Abowd. Stepping-up: The census bureau tries to be a good data steward in the 21st century. 2019.

[3] M. Antoni, R. Schnell. The past, present and future of the german record linkage center. Jahrbücher für Nationalökonomie und Statistik, 239(2):319 331, 2019.

[4] D. Bertsimas, M. L. Li. Fast exact matrix completion: A uniﬁed optimization framework for matrix completion. Journal of Machine Learning Research, 21(231):1 43, 2020.

[5] P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, P. S. Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79 85, 1990.

[6] E. J. Candès, X. Li, Y. Ma, J. Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1 37, 2011.

[7] E. J. Candès, B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009.

[8] D. Cox, J. Little, D. OShea. Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra. Springer Science & Business Media, 2013.

[9] T. Ding, Z. Zhu, R. Vidal, D. P. Robinson. Dual principal component pursuit for robust subspace learning: Theory and algorithms for a holistic approach. In International Conference on Machine Learning, 2021.

[10] I. Dokmanic. Permutations unlabeled beyond sampling unknown. IEEE Signal Processing Letters, 26(6):823 827, 2019.

[11] J. Domingo-Ferrer, K. Muralidhar. New directions in anonymization: Permutation paradigm, veriﬁability by subjects and intruders, transparency to users. Information Sciences, 337-338:11 24, 2016.

[12] D. Dua, C. Graff. UCI machine learning repository, 2017.

[13] I. P. Fellegi, A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183 1210, 1969.

[14] A. S. Georghiades, P. N. Belhumeur, D. J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):643 660, 2001.

[15] J. Harris. Algebraic Geometry: A First Course, volume 133. Springer Science & Business Media, 1992.

[16] X. He, Y. Xiao, Y. Li, Q. Wang, W. Wang, B. Shi. Permutation anonymization: Improving anatomy for privacy preservation in data publication. In PAKDD Workshops, 2011.

[17] D. Hsu, K. Shi, X. Sun. Linear regression without correspondence. In Advances in Neural Information Processing Systems, 2017.

[18] G. Lerman, T. Maunu. Fast, robust and non-convex subspace recovery. Information and Inference: A Journal of the IMA, 7(2):277 336, 2018.

[19] S. Marano, P. Willett. Making decisions by unlabeled bits. IEEE Transactions on Signal Processing, 68:2935 2947, 2020.

[20] K. Muralidhar. Record re-identiﬁcation of swapped numerical microdata. Journal of Information Privacy and Security, 13(1):34 45, 2017.

[21] A. Nejatbakhsh, E. Varol. Neuron matching in c. elegans with robust approximate linear regression without correspondence. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2021.

[22] A. Pananjady, M. J. Wainwright, T. A. Courtade. Linear regression with shufﬂed data: Statistical and computational limits of permutation recovery. IEEE Transactions on Information Theory, 64(5):3286 3300, 2017.

[23] L. Peng, M. C. Tsakiris. Linear regression without correspondences via concave minimization. IEEE Signal Processing Letters, 27:1580 1584, 2020.

[24] L. Peng, M. C. Tsakiris. Homomorphic sensing of subspace arrangements. Applied and Computational Harmonic Analysis, 55:466 485, 2021.

[25] M. Rahmani, G. K. Atia. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing, 65(23):6260 6275, 2017.

[26] R. Santa Cruz, B. Fernando, A. Cherian, S. Gould. Deeppermnet: Visual permutation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[27] R. Santa Cruz, B. Fernando, A. Cherian, S. Gould. Visual permutation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):3100 3114, 2019.

[28] A. Schmaltz, A. M. Rush, S. Shieber. Word ordering without syntax. In Conference on Empirical Methods in Natural Language Processing, 2016.

[29] T. Shen, T. Lei, R. Barzilay, T. Jaakkola. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems, 2017.

[30] M. Slawski, E. Ben-David. Linear regression with sparsely permuted data. Electronic Journal of Statistics, 13(1):1 36, 2019.

[31] M. Slawski, E. Ben-David, P. Li. Two-stage approach to multivariate linear regression with sparsely mismatched data. Journal of Machine Learning Research, 21(204):1 42, 2020.

[32] M. Slawski, G. Diao, E. Ben-David. A pseudo-likelihood approach to linear regression with partially shufﬂed data. Journal of Computational and Graphical Statistics, 0(0):1 31, 2021.

[33] M. Soltanolkotabi, E. J. Candes. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, 40(4):2195 2238, 2012.

[34] M. C. Tsakiris. Determinantal conditions for homomorphic sensing. preprint ar Xiv:1812.07966v6, 2018.

[35] M. C. Tsakiris, L. Peng. Homomorphic sensing. In International Conference on Machine Learning, 2019.

[36] M. C. Tsakiris, L. Peng, A. Conca, L. Kneip, Y. Shi, H. Choi. An algebraic-geometric approach for linear regression without correspondences. IEEE Transactions on Information Theory, 66(8):5130 5144, 2020.

[37] M. C. Tsakiris, R. Vidal. Dual principal component pursuit. The Journal of Machine Learning Research, 19(1):684 732, 2018.

[38] J. Unnikrishnan, S. Haghighatshoar, M. Vetterli. Unlabeled sensing: Solving a linear system with unordered measurements. In Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015.

[39] J. Unnikrishnan, S. Haghighatshoar, M. Vetterli. Unlabeled sensing with random linear measurements. IEEE Transactions on Information Theory, 64(5):3237 3253, 2018.

[40] N. Vaswani, T. Bouwmans, S. Javed, P. Narayanamurthy. Robust subspace learning: Robust pca, robust subspace tracking, and robust subspace recovery. IEEE Signal Processing Magazine, 35(4):32 55, 2018.

[41] G. Wang, S. Marano, J. Zhu, Z. Xu. Target localization by unlabeled range measurements. IEEE Transactions on Signal Processing, 68:6607 6620, 2020.

[42] Y. Xie, Y. Mao, S. Zuo, H. Xu, X. Ye, T. Zhao, H. Zha. A hypergradient approach to robust regression without correspondence. In International Conference on Learning Representations, 2021.

[43] H. Xu, C. Caramanis, S. Sanghavi. Robust pca via outlier pursuit. IEEE Transactions on Information Theory, 58(5):3047 3064, 2012.

[44] Y. Yao, L. Peng, M. C. Tsakiris. Unlabeled principal component analysis. ar Xiv preprint ar Xiv:2101.09446, 2021.

[45] C. You, D. P. Robinson, R. Vidal. Provable self-representation based outlier detection in a union of subspaces. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[46] H. Zhang, P. Li. Optimal estimator for unlabeled linear regression. In International Conference on Machine Learning, 2020.

[47] T. Zhang, Y. Yang. Robust pca by manifold optimization. Journal of Machine Learning Research, 19(1):3101 3139, 2018.

[48] Z. Zhu, Y. Wang, D. P. Robinson, D. Naiman, R. Vidal, M. C. Tsakiris. Dual principal component pursuit: Improved analysis and efﬁcient algorithms. In Advances in Neural Information Processing Systems, 2018.