# noisy_dual_principal_component_pursuit__3293db27.pdf Noisy Dual Principal Component Pursuit Tianyu Ding * 1 Zhihui Zhu * 2 Tianjiao Ding 3 Yunchen Yang 3 Ren e Vidal 2 Manolis C. Tsakiris 3 Daniel P. Robinson 1 Dual Principal Component Pursuit (DPCP) is a recently proposed non-convex optimization based method for learning subspaces of high relative dimension from noiseless datasets contaminated by as many outliers as the square of the number of inliers. Experimentally, DPCP has proved to be robust to noise and outperform the popular RANSAC on 3D vision tasks such as road plane detection and relative pose estimation from three views. This paper extends the global optimality and convergence theory of DPCP to the case of data corrupted by noise, and further demonstrates its robustness using synthetic and real data. 1. Introduction Dual Principal Component Pursuit (DPCP) is a recently proposed method for learning a linear subspace S RD from a dataset e X RD L contaminated by outliers (Tsakiris & Vidal, 2015; 2017; 2018a; Zhu et al., 2018a;b). Specifically, DPCP minimizes an ℓ1 co-sparse objective on the sphere: e X b 1 s.t. b 2 = 1. (1) The aim is to estimate a basis for the orthogonal complement of the subspace, hence the attribute dual. As such, DPCP is ideally suited for subspaces of high relative dimension, i.e., those subspaces with dimension d such that d/D is close to one. A typical example is the case of hyperplanes (d = D 1), which very commonly appear in 3D computer vision applications such as detecting planar structures in 3D point clouds (Geiger et al., 2013; Silberman et al., 2012) or estimating relative poses in multiple-view geometry (Hartley & Zisserman, 2000). *Equal contribution 1Department of Applied Mathematics & Statistics, Johns Hopkins University, USA 2Mathematical Institute for Data Science, Johns Hopkins University, USA 3School of Information Science and Technology, Shanghai Tech University, China. Correspondence to: Tianyu Ding , Zhihui Zhu . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). In the high relative dimension regime, state-of-the-art convex optimization based methods relying on sparse and lowrank representations (Xu et al., 2010; Soltanolkotabi & Cand es, 2012; Rahmani & Atia, 2017; You et al., 2017) typically exhibit a significant decrease in performance. On the other hand, since its inception almost 40 years ago, the Random Sampling And Consensus (RANSAC) (Fischler & Bolles, 1981) algorithm has been one of the most popular methods in computer vision for the high relative dimension setting. RANSAC alternates between fitting a subspace to a randomly sampled minimal number of points (D 1 in the case of a hyperplane) and then using the number of datapoints close to the subspace as a measure of the quality of the subspace. The interplay between four factors governs when RANSAC is successful: the ambient dimension, the outlier ratio, the thresholding parameter for determining when points are considered close to a subspace, and the allocated time budget. In particular, RANSAC can be extremely effective when the probability of sampling outlier-free samples inside the allocated time budget is large. Recently (Tsakiris & Vidal, 2018a), it has been shown that an Iteratively-Reweighted-Least-Squares algorithm (DPCPIRLS) for solving the non-convex DPCP problem (1) can successfully handle 30% 50% outliers in the three-view geometry problem, while state-of-the-art RANSAC variations fail when given the running time of DPCP-IRLS as a time budget. Even more recently (Zhu et al., 2018a), it has been demonstrated that a certain projected subgradient method (DPCP-PSGM) solves (1) to global optimality using only matrix-vector multiplications, and correctly performs road plane detection from a 3D cloud of approximately O(105) points with 50% outliers in just a few hundred milliseconds, a time window in which RANSAC can only perform a few iterations and thus fails. These results highlight the significance of DPCP as a potential alternative to RANSAC. In the terminology of the review paper of Lerman & Maunu 2018, DPCP is in effect a least absolute deviations subspace learning method. Such methods compute the subspace by aiming to minimize the sum of the distances between all points in the dataset and the subspace; this is precisely the formulation (1) when the subspace is a hyperplane. For example, REAPER (Lerman et al., 2015) applies a convex relaxation that is solved via an IRLS scheme (Zhang & Lerman, 2014; Zhang, 2016). Although REAPER is known Noisy Dual Principal Component Pursuit to perform competitively, the regime in which theoretical guarantees are ensured excludes the high relative dimensional setting, which we conjecture is a consequence of using a convex relaxation. The work closest to DPCP is that of Maunu et al. 2019, which studies a Geodesic Gradient Descent (GGD) method for solving the least absolute deviations problem without any relaxation. GGD is shown to converge to the global optimum at a sublinear rate, and to be able to handle M = O(N) outliers with N inliers; the latter property is common to many robust PCA methods (Lerman & Maunu, 2018). In contrast, Zhu et al. 2018a showed that under a noiseless spherical statistical model, any global minimizer to (1) is a normal vector to the subspace as long as M = O(N 2). Moreover, the DPCP-PSGM algorithm mentioned above provably converges to the global optimum of (1) in a piece-wise linear rate providing its step-size is tuned in a piece-wise geometrically diminishing fashion. Although the theoretical and algorithmic features of DPCP are appealing, they have only been established for the idealized case when inliers perfectly lie in the subspace. Yet, DPCP has proved to be competitive on noisy real datasets, so that it is reasonable to ask whether similar theoretical guarantees hold when there is noise in the data. This work bridges that gap by making the following contributions. We provide a geometric analysis of global optimality for DPCP that reveals that global minimizers of (1) are perturbed away from the orthogonal complement S of the inlier subspace by an amount proportional to the noise level, while still tolerating M = O(N 2) outliers. We prove that the DPCP-PSGM method, even in the presence of noise, converges to a neighborhood of S at a piece-wise linear rate, if tuned properly. Connections are drawn to the literature of absolute least deviations in subspace learning, where in particular we establish the equivalence between DPCP-PSGM and the GGD method of Maunu et al. 2019. An experiment on road plane detection with real 3D data further strengthens the view that DPCP is superior to RANSAC in the high relative dimension setting. 2. Global Optimality for Noisy DPCP 2.1. Background and motivation Consider a unit ℓ2-norm dataset e X E = X + E O Γ, where X = [x1, , x N] RD N are inlier points spanning a single d-dimensional underlying subspace S of RD, E = [ϵ1, , ϵN] RD N consists of additive noise on inlier points, O = [o1, , o M] RD M are outlier points and Γ is an unknown permutation. Our goal is to estimate the underlying subspace S from e X E. When there is no noise (i.e., E = 0) and the points are in general position, the vectors b that make e X E b as sparse as possible are pre- cisely those satisfying b S; this is the motivation for (1). Therefore, in the noisy case we expect e X E b to be close to a sparse vector y in the Euclidean sense, whenever b is close to a normal vector of S. This motivates the following optimization problem (Tsakiris & Vidal, 2018a)1: min b SD 1,y RN+M τ y 1 + 1 2 y e X E b 2 2, (2) for some τ > 0, where SD 1 := b RD : b 2 = 1 . As expected, the performance of (2) depends crucially on the parameter τ. An alternative way is to directly adopt (1): min b SD 1 e X E b 1. (3) We use synthetic experiments to compare (2) and (3), with data generated according to the following random model: Definition 1 (Random spherical model). Consider a random spherical model where the columns of O are drawn uniformly from the sphere SD 1, the columns of noisy inliers X +E are drawn from the sphere SD 1 by normalizing i.i.d. N(0, 1 D ID)-distributed points to have unit ℓ2-norm, where d = dim S, PS is the ortho-projector onto S, and σ is the standard deviation of the noise; under this model the SNR is E[ X F ]/E[ E F ] = 1/σ. Fig. 1 shows the numerical comparison between (2) and (3). We solve (2) by alternating minimization, which empirically converges fast even though it has no known convergence theory. We use DPCP-PSGM (Algorithm 1), whose convergence analysis will be discussed in Section 3, for solving (3). Fig. 1a shows that even though τ is chosen to be optimal, (2) and (3) perform competitively. Fig. 1b implies that the formulation (3), which does not depend on any hyperparameter, is robust to noise, whereas the solution of (2) is sensitive to τ. We have observed similar phenomena for different D, d, M, N and σ. Based on this evidence, in this paper we focus on (3). Algorithm 1 DPCP-PSGM for (3) 1: Input: data e X E RD (N+M) and initial step size µ0. 2: Initialization: Set bb0 arg minb SD 1 e X E b 2. 3: for k = 1, 2, . . . do 4: Compute sub-gradient: gk 1 e X Esign( e X E bbk 1). 5: Update the step size µk according to a certain rule. 6: Compute next iteration: bk bbk 1 µkgk 1. 7: Project bk onto the unit sphere: bbk bk/ bk 2. 8: end for 1Problem (2) has also appeared in the context of dictionary learning, see Qu et al. 2014. Noisy Dual Principal Component Pursuit 0 0.05 0.1 0.15 0 denoised version (2) formulation (3) (a) σ = 0.05 0 0.1 0.2 0 0.2 formulation (3) =0.005 =0.015 =0.03 (b) Varying σ Figure 1. Comparison between computed solutions of (2) and (3) in terms of their principal angle θ from S . Here, we fix D = 30, d = 29, N = 500, and outlier ratio M/(M + N) = 0.7. 2.2. Geometric quantities and their concentrations We aim to provide a global optimality analysis for (3). Since the objective in (3) is not continuously differentiable, we need to deal with its subdifferential. Denote the sign function by sign(a) = a/|a| when a = 0, and sign(0) = 0, and denote the subdifferential of the absolute value function |a| by Sgn(a) = sign(a) when a = 0, and Sgn(0) = [ 1, 1]. We also apply sign and Sgn element-wise to vectors. To analyze (3), first note that any global solution b to (3) must be a critical point, i.e., there exists v e X E b 1 such that (I b b )v = 0, where e X E b 1 = (X + E) Sgn((X + E) b) + O Sgn(O b). (4) When noise is not present (i.e., E = 0), the term Sgn (X + E) b = Sgn(X b) is simple since it only relates to inliers. In the noisy case, however, it is much more complicated to deal with this term. For example, since the function sign is discontinuous, Sgn (X + E) b cannot easily be separated into two parts with one part only involving the inliers and the other part only involving the noise. As a consequence, a significantly more technical analysis is required to analyze the effect of noise. We now introduce several helpful geometric quantities. We first characterize the maximum norm of a Riemannian subgradient of 1 M O b 1: M max b SD 1 (I bb )O sign(O b) 2. (5) As it turns out, ηO characterizes how well the outliers are distributed in the ambient space: more uniformly distributed outliers will lead to smaller value for ηO (Zhu et al., 2018a). To facilitate an analysis, we decompose the noise as E = Es + En, where Es is the projection of the noise onto S and En is the projection of the noise onto S . Denote b X := X + Es and b E := En such that the columns of b X lie in S and the columns of b E lie in S . b X can be viewed as effective inliers since they lie in S, whereas b E can be interpreted as effective noise because it perturbs b X away from S. Define the permeance statistic (Lerman et al., 2015) c b X,min := 1 N min b S SD 1 b X b 1, (6) which attains larger values for better distributed inliers. We capture the effective noise b E via the quantity cb E,max := 1 N max b S SD 1 b E b 1. (7) This is closely related to the total inlier residual R(S) := 1 N PN j=1 bϵj 2 used by Lerman et al. 2015 to measure the level of the effective noise. By the Cauchy-Schwartz inequality |bϵ j b| bϵj 2 b 2, it is clear that cb E,max is a lower bound of R(S) since b 2 = 1. Indeed, R(S) only depends on the energy of b E, whereas cb E,max also depends on the distribution of b E: the more uniformly distributed b E is in S , the smaller cb E,max is. Thus, cb E,max leads to a tighter result in our analysis than if one used R(S). Note that c b X,min involves a mixture of inliers and components of noise projected onto S. This particular integration of inliers and noise leads to tighter deterministic bounds in the deterministic phase of our analysis. In turn, this will be advantageous in the subsequent probabilistic analysis (Lerman et al., 2015; Tsakiris & Vidal, 2018b). We recall the probabilistic result for ηO from Zhu et al. 2018a;b and provide new bounds for c b X,min, cb E,max. Define δ : [0, 1) R and ρ : [0, 1) R as δ(σ) := σ + q (1 σ)Fd,D d(σ), (8) ρ(σ) := (1 σ)FD d,d(1/σ), (9) where Fd1,d2( ) is the cumulative density function (CDF) of the F-distribution with Fd1,d2(0) = 0 and Fd1,d2( ) = 1. Expanding the CDFs, we have δ(σ) = O(σd/4 + σ) and ρ(σ) = 1 O(σ + σd/2). We now state our first result. Lemma 1. Consider the random spherical model in Definition 1 and let σ < 1. For any t > 0, there exists a universal constant C independent of M, N, D, d, t, and σ such that D log D + t)/ M) i 1 2e t2 P h c b X,min p 2/πdρ(σ) (2 + t/2)/ P h cb E,max (1 + 2/ N)δ(σ) + t/ Although our result for c b X,min reduces to the one for c X,min in Zhu et al. 2018a when E = 0, the concentration derivation for c b X,min is more involved since in the noisy case and under the above spherical statistical model, the columns of b X now lie inside the unit sphere as opposed to on the sphere. Noisy Dual Principal Component Pursuit 0.00 0.05 0.10 σ (a) Varying σ 0.1 0.3 0.5 0.7 M/(M + N) 0.4 R / R / (b) Varying outlier ratio Figure 2. Plots of RO/ b X and Rb E/ b X as a function of (a) σ and (b) outlier ratio. Here we fix D = 30, d = 29, N = 1500, and M/(M + N) = 0.7 in (a), and σ = 0.05 in (b). Since δ(σ) = O(σd/4 + σ), (10) essentially implies that cb E,max = O(σd/4 + σ) with high probability 2. Two more definitions are needed for our analysis: RO/ b X := M ηO c b X,min , Rb E/ b X := cb E,max c b X,min , (11) where ηO := ηO +D/M. RO/ b X and Rb E/ b X can be simply viewed as outlier-to-inlier and noise-to-inlier type of ratios, respectively. When we fix inliers and outliers, Rb E/ b X is proportional to the noise level (see Fig. 2a). Similarly, when we fix inliers and noise level, RO/ b X is proportional to the number of outliers (see Fig. 2b). 2.3. Geometry of the critical points For the rest of the analysis, let θ [0, π/2] be the principal angle of a vector b SD 1 from the orthogonal complement subspace S . Thus, b is normal to S if and only if θ = 0. Using RO/ b X and Rb E/ b X defined in (11), we can now characterize the geometry of the critical points of (3). Lemma 2. Assume RO/ b X < 1 and 32Rb E/c X q R2 O/c X +8 3RO/c X 3 R2 O/c X +8+RO/c X 1 2 < 1, (12) then any critical point b of problem (3) must have its principal angle θ from S satisfy: θ sin 1(t1) or θ sin 1(t2), (13) where 0 t1 t2 1 are the two nonnegative roots of the following quartic equation: t4 +(R2 O/ b X 1)t2 +4RO/ b X Rb E/ b X t+4R2 b E/ b X = 0. (14) First note that RO/ b X < 1 ensures that the denominator of the left hand side in (12) is well-defined. Since the func- tion a 7 f(a) = a2 + 8 3a 3 a2 + 8 + a 1 2We note that t 2 N in (10) may be improved to a quantity proportional to σ using a more sophisticated analysis. 0.0 0.5 1.0 R / (a) Value of t1 0.0 0.5 1.0 R / (b) Value of t2 Figure 3. Plot of (a) t1 and (b) t2 when varying RO/ b X and Rb E/ b X . In each plot, condition (12) holds only in the area below the curve, which corresponds to valid pairs of (RO/ b X , Rb E/ b X ). decreasing between [0, 1] with f(0) = 8 and f(1) = 0, (12) implies that larger noise levels lead to smaller numbers of outliers that DPCP can tolerate. With (12), it can be shown that (14) has two nonnegative roots 0 t1 t2 1, and (13) implies that none of the critical points have principal angle in (sin 1(t1), sin 1(t2)). Fig. 3 displays t1 and t2 while varying RO/ b X and Rb E/ b X under condition (12). One can observe that smaller percentages of outliers and noise levels lead to t1 being closer to 0 and t2 being closer to 1, which means that critical points of (3) either lie in a neighborhood of S or very close to S. The following bound helps in further interpreting Lemma 2: t1 25Rb E/ b X /(1 RO/ b X )2. (15) In particular, this means that t1 is close to 0 when Rb E/ b X and RO/ b X are small. More generally, for fixed O and b X, (15) guarantees that t1 is perturbed away from 0 by at most the effective noise level, which makes sense intuitively. When there is no noise (E = 0), Lemma 2 reduces to Lemma 1 in Zhu et al. 2018a: Rb E/ b X = 0 and RO/ b X = ηO/c X,min, so that (12) always holds and (14) becomes t4 + ((ηO/c X,min)2 1)t2 = 0, which implies t1 = 0 and t2 = p 1 (ηO/c X,min)2. Nevertheless, we stress that the proof for Lemma 2 is far more complicated than for the noiseless case, partly because of the need to deal with Sgn (X + E) b as per the discussion right after (4). 2.4. Global optimality Before characterizing the global solutions of (3), we recall two outlier-based quantities c O,min := 1 M min b SD 1 O b 1, c O,max := 1 M max b SD 1 O b 1 that are already used by Zhu et al. 2018a;b and scales as O( 1 M ). We note that better distributed outliers lead to smaller values of c O,max c O,min. The next result gives a condition under which global solutions to (3) lie in a neighborhood of S . Noisy Dual Principal Component Pursuit Theorem 1. If RO/ b X < 1, (12) holds, and N c O,max c O,min c b X,min < t2 2Rb E/ b X , (16) then any global minimizer b of (3) must have its principal angle θ from S satisfy θ sin 1(t1), (17) where 0 t1 t2 1 are the nonnegative roots of (14). Theorem 1 builds upon Lemma 2, with the intuition that critical points that are close to the subspace S (i.e., for which θ sin 1(t2)) cannot be global minimizers as they result in large objective values. As long as data points are well-distributed (small c O,max c O,min, large c b X,min, large t2) and effective noise is mild (small cb E,max), (16) will be satisfied and global minimizers must be close to S . When E = 0, we have already remarked that t1 = 0 and t2 = p 1 (ηO/c X,min)2, which together with (16) and (17) imply that global minimizers are orthogonal to S when N c O,max c O,min c X,min < q 1 (ηO/c X,min)2, which is precisely Theorem 1 of Zhu et al. 2018a. We further interpret the above global optimality result, via the following probabilistic characterization. Theorem 2. Consider the random spherical model of Defi- nition 1 and let σ < 1. If 0 < t < 2 q πd ρ(σ) 2 , then with probability at least 1 10e t2/2, any global solution to (3) must have its principal angle θ from S satisfy sin(θ ) C1δ(σ) + t 2 2 πdρ(σ) C2 t DM log D N 4+t D log D + t)2 πd ρ(σ) C4δ(σ) 4 + 3t where C1, C2, C3, C4 are universal constants that are independent of N, M, D, d, t and σ. The effect of the noise in perturbing the global solution away from S is captured by (18), where the right hand side (RHS) approaches 0 when σ 0, except for the small term t 2 N , which as we stated after Lemma 1 can be improved to a quantity proportional to σ. Moreover, (18) together with δ(σ) = O(σd/4 + σ) and ρ(σ) = 1 O(σ + σd/2) imply that sin(θ ) = O(( σ+σd/4)) when σ is small. The inequality (19) suggests that, unlike existing state-of-the-art 0 100 200 300 400 500 0 100 200 300 400 500 (b) σ = 0.1 Figure 4. Plot of sin(θ ) where θ is the principal angle between the computed solution b to the DPCP problem (3) and S when varying N and M for noise level (a) σ = 0 and (b) σ = 0.1. Here D = 30 and d = 29. O(N) outlier bounds (Lerman et al., 2015; Maunu et al., 2019), DPCP can tolerate O(N 2) outliers even for noisy data. Fig. 4 verifies this point by plotting sin(θ ) (b is computed via Algorithm 1). 3. Convergence Analysis of Noisy DPCP-PSGM The convergence of DPCP-PSGM (Algorithm 1) has been analyzed by Zhu et al. 2018a;b in the absence of noise. Their main finding is that selecting the step size in a piecewise geometrically diminishing fashion guarantees piecewise linear convergence to a vector orthogonal to S. In the noisy case, one can only expect Algorithm 1 to converge to a neighborhood of S . A significantly more involved analysis yields the following convergence result. Theorem 3 (Piecewise linear convergence). Suppose that Nc b X,min Nη b X,b E 5 2Ncb E,max 7 N b E 2 > MηO, (20) where η b X,b E = 1 N sup (c,b,g) F (I bb ) b X sign( b X b+c b E g) 2 with F := {(c, b, g) : c [0, ), g S SD 1, b S SD 1}.3 Let {bbk} be the sequence generated by Algorithm 1 with noisy data and initialization bb0 whose principal angle from the orthogonal subspace S satisfies θ0 < tan 1 κ e X E/ν e X E , (21) where κ e X E := Nc b X,min Ncb E,max N b E 2 and ν e X E := Nη b X,b E + MηO + N b E 2. Consider the following piecewise geometrically diminishing step size ( µ0, k < K0, µ0β (k K0)/K +1, k K0, (22) where µ0 µ := 1/16 max{Ncc X,max,Mc O,max, N b X 2}, β < 1, is the floor function, K0, K N are chosen such that 3η b X,b E = O( 1 N ) for the random spherical model. Noisy Dual Principal Component Pursuit 0 2500 5000 iteration μk tan(θk) upper bound (23) 0 2500 5000 iteration μk tan(θk) upper bound (23) (b) σ = 10 6 Figure 5. Convergence of DPCP-PSGM in both noiseless and noisy case. θk is the principal angle of iterate bbk from S . The red dotted line represents the upper bound on tan(θk) given by (23), while the green dashed line indicates the step size (22). K 2 βµ (Ncc X,min ν f XE ) and K0 tan(θ0) := min n κ e X E tan(θ0)ν e X E, 1 6(Nc b X,min ν e X E) o > 0. Then the principal angle θk of bbk from S satisfies µ β (k K0)/K + tan(θ ) for k K0, (23) tan(θk) max{tan(θ0), µ0 µ + tan(θ )} for k < K0 with θ := tan 1 2(Ncb E,max + Nc b X,min ν e X E Under the random spherical model and for small noise levels, the main hypothesis of Theorem 3, (20) is satisfied as long as there are at most M = O(N 2) outliers. In that regime, Theorem 3 guarantees that DPCP-PSGM converges to a neighborhood of S providing data points are welldistributed (small η b X,b E, small ηO, large c b X,min) and the effective noise is mild (small cb E,max and b E ): (23) implies that the principal angle θk decays in a piecewise linear rate until θ , which reflects the noise effect; see Fig. 5b. Also, larger noise levels lead to larger cb E,max and b E 2, and thus to larger θ . If no noise is present, θ = 0 and Theorem 3 is consistent with Zhu et al. 2018a; see Fig. 5a. We note that for the sake of interpretability the final angle θ in (24) has intentionally been made looser than the analytical bound θ in (17). This causes no harm since it can be shown that both angles scale as δ(σ). Finally, the spectral initialization in Algorithm 1 can be shown to satisfy (21) subject to some further mild conditions on the data. 4. Comparison with state-of-the-art As noted in 1, DPCP is very closely related to leastabsolute-deviations subspace learning methods. Two important representatives of that class are REAPER (Lerman et al., 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 (a) This paper 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 (b) Lerman et al. 2015 Figure 6. Check whether (a) (16) and (b) (26) are satisfied (white) or not (black) when varying the outlier ratio M/(M + N) and σ. Here we fix D = 30, d = 29, and N = 1000. 2015) and the Geodesic-Gradient-Descent (GGD) method of Maunu et al. 2019. For the case of a hyperplane, GGD attempts to solve the same optimization problem as DPCP, while REAPER a convex relaxation of it. In this section we compare the results of this paper to those known for REAPER and GGD. We show that the global optimality conditions for DPCP given in the present paper are much looser compared to those required for REAPER ( 4.1). In fact, they are an improvement even over conditions enabling a local stability characterization of the function landscape given by Maunu et al. 2019 ( 4.2). Finally, we show that for a suitable tuning of the step-size GGD is equivalent to DPCP-PSGM ( 4.3). 4.1. Comparison with REAPER (Lerman et al., 2015) Theorem 2.1 of Lerman et al. 2015 asserts that any global minimizer of the REAPER problem must satisfy sin(θ ) 2NR(S) h N 4 dc b X,min MA (S) NR(S) i where R(S) := 1 N PN j=1 bϵj 2 cb E,max is the total inlier residual, A (S) := 1 M O 2 PS O 2 0 is an alignment statistic that measures the amount of linear structure in the outliers, and [α]+ = α if α > 0 and 0 otherwise. Here PS is the orthoprojection onto S and the overline spherization operator normalizes the columns of a matrix. Note that (25) is meaningful only when MA (S) Nc b X,min < 1 4 d Rb E/ b X . (26) We compare the necessary condition (26) for REAPER to (12) and (16) for DPCP. When there are no outliers (26) requires Rb E/ b X < 1 4 d. By contrast, (12) only requires Rb E/ b X < 1 4 (see Fig. 3). More generally, in the presence of outliers, MA (S) (the numerator in LHS of (26)) scales as O(M) in a random model (Lerman et al., 2015), whereas the quantity M(c O,max c O,min) (the numerator in LHS of (16)) Noisy Dual Principal Component Pursuit 0.00 0.05 0.10 σ d = 1, M N + M = 0.1 d = 1, M N + M = 0.7 d = 29, M N + M = 0.1 d = 29, M N + M = 0.7 (a) This paper 0.00 0.05 0.10 σ RHS of (25) d = 1, M N + M = 0.1 d = 1, M N + M = 0.7 d = 29, M N + M = 0 d = 29, M N + M = 0.01 (b) Lerman et al. 2015 Figure 7. Evaluation of (a) t1 and (b) (25), with D = 30 and N = 1500. In (b) for d = 29, we only plot (25) for M M+N {0, 0.01} since (26) does not hold for a mild size of the outlier ratio. scales as O( M); see Zhu et al. 2018b. Numerically, this is captured in Fig. 6, in which we observe that (16) is satisfied for a much larger range of outlier ratio and noise levels. Finally, note that R(S) appears both in the numerator and denominator in the RHS of (25), which makes the entire upper bound blow up quickly when the noise level increases; see Fig. 7b. In contrast, according to Theorem 1 and (15), sin(θ ) is roughly proportional to the effective noise level (see Fig. 7a). 4.2. Comparison with the local optimality conditions of Maunu et al. 2019 Let b be a critical point of (3). Then, given 0 < η < γ < π/2 such that a certain stability condition holds, Theorem 2 of Maunu et al. 2019 asserts that either θ < η or θ > γ. (27) Note that a tighter analysis corresponds to a smaller η (closer to 0) and a larger γ (closer to π/2). To fairly compare (27) and (13) numerically, we manually set η equal to arcsin(t1) and compare arcsin(t2) and γ. Fig. 8 shows the comparison between γ and arcsin(t2) under different percentages of outliers and noise levels. We can observe that arcsin(t2) is always larger than γ by a significant amount, under the restriction that η is equal to arcsin(t1), thus suggesting that (13) is a tighter result compared with (27). Moreover, (27) is sensitive to the variation of the outliers, while (13) is rather stable (compare Fig. 8a to Fig. 8b). Finally, we mention that the relationship between η and γ is not as clear as for our t1 and t2, with the latter being the two non-negative roots of the univariate quartic in (14). In conclusion, we believe that Lemma 2 represents a theoretical and computational improvement over the important characterization of the critical points of (3) given previously by Maunu et al. 2019. 4.3. Equivalence with GGD of Maunu et al. 2019 We now relate DPCP-PSGM to the GGD of Maunu et al. 2019. Towards that end, we consider the hyperplane 0.00 0.05 0.10 σ arcsin(t2) γ (a) M/(M + N) = 0.1 0.00 0.05 0.10 σ arcsin(t2) γ (b) M/(M + N) = 0.2 Figure 8. Comparison between the quantity γ of Maunu et al. 2019 and arcsin(t2) in the hyperplane case with outlier ratio (a) 0.1 and (b) 0.2. Here we fix D = 30, d = 29, and N = 3000. case and interpret the GGD as finding a normal vector instead of a basis for the hyperplane. Rewriting the subgradient in Algorithm 1 as gk 1 = egk 1 + gk 1, where egk 1 = (I bbk 1bb k 1)gk 1 is the Riemannian subgradient and gk 1 = bbk 1bb k 1gk 1, the GGD is the same as Algorithm 1 except that the iterate is updated by bb k = cos(µ k)(bbk 1 tan(µ k)egk 1/ egk 1 2), where µ k is the step size used in GGD. To relate bb k to bbk in Algorithm 1, we first rewrite bk = (1 µkg k 1bbk 1)(bbk 1 µkegk 1 1 µkg k 1bbk 1 ). Noting that bbk is obtained by normalizing bk, we have bb k = bbk if we set µ k = tan 1( µk egk 1 2 1 µkg k 1bbk 1 ). Proposition 1. For the hyperplane case, with a suitable choice of step size the GGD (Maunu et al., 2019) is equivalent to DPCP-PSGM. Thus, the convergence guarantee in Theorem 3 can be directly applied for GGD under a suitable choice of step size. For the general case where the subspace has co-dimension larger than 1, we conjecture that the analysis in Theorem 3 will be helpful for the convergence analysis of GGD. 5. Road Plane Estimation Using Real 3D Data We use the experimental setup of Zhu et al. 2018a to further compare DPCP, RANSAC, and alternative methods in the task of 3D road plane detection. In this problem one is given a 3D point cloud of a road scene and the goal is to learn an affine plane A = H + t R3 as a model for the road. This is important in autonomous driving applications. Here H is a plane through the origin with normal vector b and t is its translation with respect to the origin; this latter is the center of the laser sensor. Hence the task is to estimate b and t, which are taken to be co-linear in order to resolve the inherent ambiguity in estimating t. In turn, this can be converted to a linear subspace learning problem by working in homogeneous coordinates, i.e., by embedding A into the linear hyperplane H R4 with normal vector b = [b t b] , through the mapping x 7 [x 1] . Noisy Dual Principal Component Pursuit Methods/metric ROC ˆ θ ˆθ ˆt iter. time SVD 0.76 4.40 1.73 14% N/A 1 RANSAC 1 0.78 3.74 4.18 12% 3.8 31 RANSAC 10 0.91 1.58 2.85 5% 18.7 149 RANSAC 100 0.93 1.47 2.77 4% 64.1 515 ℓ2,1-RPCA 0.77 4.35 1.72 14% 2.8 30 REAPER 0.88 2.48 1.07 8% 4.1 27 DPCP-IRLS 0.81 3.67 1.48 12% 3.0 29 DPCP-d 0.92 1.51 0.82 5% 6.5 16 DPCP-PSGM 0.92 1.59 0.76 5% 37.3 24 Table 1. 3D road plane estimation using 125 annotated frames of the KITTI dataset. Running time is in msec. We use the 3D point clouds from the KITTI dataset (Geiger et al., 2013). In addition to the 7 frames annotated in Zhu et al. 2018a, we further annotate 131 frames. Each point cloud contains around 105 points with approximately 50% outliers. The data are homogenized and normalized to unit ℓ2-norm. We compare DPCP-PSGM (Algorithm 1), DPCPIRLS and DPCP-d (Tsakiris & Vidal, 2018a) to RANSAC, REAPER and ℓ2,1-RPCA (Xu et al., 2010). We also include SVD, which calculates b as the bottom singular vector of the data. Since DPCP-PSGM and DPCP-d are among the fastest methods with comparable running times, we let them run to convergence, and set the running time of the slowest as time budget for the rest methods. We update the step size in DPCP-PSGM via a modified backtracking line search, which is known to perform well in practice. For RANSAC, we also include a version with 10 and 100 that time budget. We tune the parameters of the algorithms on a randomly selected training set of 13 frames and use the rest of the frames for evaluation. Each method is tuned to achieve an optimal error and then re-tuned to be as fast as possible without exceeding 5% of that error. The λ of ℓ2,1RPCA is set to 1.92 M , the τ of DPCP-d is set to 2.76 N+M , µmin for DPCP-PSGD is set to 10 9, and the relative convergence accuracy, wherever applicable, is set to 10 6. Table 1 reports geometric, clustering and algorithmic metrics for the various methods. Once a method has computed an estimated normal vector ˆ b R4, we extract from it estimates ˆb, ˆt. We report the corresponding estimation errors, i.e., the angle ˆ θ between b and ˆ b, the angle ˆθ between b and ˆb, and 100 t ˆt 2 / t 2 %, where b , b , t are the ground-truth values. By varying a threshold on the distances of all points to the estimated affine plane, the area under the ROC curve is obtained4, with higher values indicating better performance. Finally, the number of iterations executed by each method and its running time in msec5 are 4For RANSAC this is also its internal thresholding parameter. 5Experiments done on a laptop with Intel i7-6700HQ @ 2.6GHz CPU, 16GB 2133MHz DDR4 Memory. Ground Truth RANSACx1 DPCP-PSGM RANSACx100 Ground Truth DPCP-PSGM DPCP-IRLS DPCP-d RANSACx10 RANSACx100 Figure 9. Frame 328 of KITTY-CITY-71, with inliers in blue and outliers in red. Top: 3D point clouds and estimated translations. Ground-truth thresholding parameter is used for RANSAC. Bottom: Projections of 3D point clouds onto the image. also reported. Notably, not only DPCP-PSGM and DPCP-d outperform RANSAC 1 and RANSAC 10, rather their performance is comparable with that of RANSAC 100, which they further surpass, e.g., in estimating the orientation of the normal vector b : RANSAC 100 is off by 2.77 on average, while DPCP-PSGM and DPCP-d only by 0.76 and 0.82 respectively; see also Fig. 9. On the other hand, DPCP-IRLS and REAPER make heavy use of SVD, which makes them slow to run on O(105) points, and eventually inaccurate given the limited time budget. Acknowledgment The co-authors from JHU were supported by NSF grant 1704458. The co-authors from Shanghai Tech were supported by Shanghai Tech grant 2017F0203-000-16. Noisy Dual Principal Component Pursuit Chakraborty, R., Hauberg, S., and Vemuri, B. C. Intrinsic grassmann averages for online linear and robust subspace learning. IEEE Conference on Computer Vision and Pattern Recognition, 2017. Chiang, K.-Y., Dhillon, I. S., and Hsieh, C.-J. Using side information to reliably learn low-rank matrices from missing and corrupted observations. Journal of Machine Learning Research, 19(76):1 35, 2018. Dai, B., Wang, Y., Aston, J., Hua, G., and Wipf, D. Connections with robust pca and the role of emergent sparsity in variational autoencoder models. Journal of Machine Learning Research, 19(41):1 42, 2018. Diakonikolas, I., Kamath, G., Kane, D., and Li, J. Being robust (in high dimensions) can be practical. International Conference on Machine Learning, 2017. Fischler, M. A. and Bolles, R. C. Ransac random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 26:381 395, 1981. Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013. Hartley, R. and Zisserman, A. Multiple View Geometry in Computer Vision. Cambridge, 2000. Lerman, G. and Maunu, T. An overview of robust subspace recovery. Proceedings of the IEEE, 106(8):1380 1410, 2018. Lerman, G., Mc Coy, M. B., Tropp, J. A., and Zhang, T. Robust computation of linear models by convex relaxation. Foundations of Computational Mathematics, 15 (2):363 410, 2015. Marinov, T. V., Mianjy, P., and Arora, R. Streaming principal component analysis in noisy setting. International Conference on Machine Learning, 2018. Maunu, T., Zhang, T., and Lerman, G. A well-tempered landscape for non-convex robust subspace recovery. Journal of Machine Learning Research, 20(37):1 59, 2019. Qu, Q., Sun, J., and Wright, J. Finding a sparse vector in a subspace: Linear sparsity using alternating directions. In Advances in Neural Information Processing Systems, pp. 3401 3409, 2014. Rahmani, M. and Atia, G. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Transactions on Signal Processing, 65(23), 2017. Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746 760. Springer, 2012. Soltanolkotabi, M. and Cand es, E. J. A geometric analysis of subspace clustering with outliers. Annals of Statistics, 40(4):2195 2238, 2012. Tsakiris, M. and Vidal, R. Dual principal component pursuit. ICCV Workshop on Robust Subspace Learning and Computer Vision, pp. 10 18, 2015. Tsakiris, M. and Vidal, R. Hyperplane clustering via dual principal component pursuit. International Conference on Machine Learning, 2017. Tsakiris, M. and Vidal, R. Dual principal component pursuit. Journal of Machine Learning Research, 19(18): 1 50, 2018a. Tsakiris, M. and Vidal, R. Theoretical analysis of sparse subspace clustering with missing entries. International Conference on Machine Learning, 2018b. Xu, H., Caramanis, C., and Sanghavi, S. Robust pca via outlier pursuit. In Neural Information Processing Systems, 2010. You, C., Robinson, D., and Vidal, R. Provable selfrepresentation based outlier detection in a union of subspaces. IEEE Conference on Computer Vision and Pattern Recognition, 2017. Zhang, T. Robust subspace recovery by tyler s m-estimator. Information and Inference: A Journal of the IMA, 5(1): 1 21, 2016. Zhang, T. and Lerman, G. A novel m-estimator for robust pca. The Journal of Machine Learning Research, 15(1): 749 808, 2014. Zhang, T. and Yang, Y. Robust pca by manifold optimization. Journal of Machine Learning Research, 19(80): 1 39, 2018. Zhang, Y., Shi, D., Gao, J., and Cheng, D. Low-rank-sparse subspace representation for robust regression. IEEE Conference on Computer Vision and Pattern Recognition, 2017. Zhu, Z., Wang, Y., Robinson, D., Naiman, D., Vidal, R., and Tsakiris, M. Dual principal component pursuit: Improved analysis and efficient algorithms. Neural Information Processing Systems, 2018a. Zhu, Z., Wang, Y., Robinson, D., Naiman, D., Vidal, R., and Tsakiris, M. Dual principal component pursuit: Improved analysis and efficient algorithms. ar Xiv preprint ar Xiv:1812.09924, 2018b.