# efficient_algorithms_for_sumofminimum_optimization__be67f544.pdf Efficient Algorithms for Sum-of-Minimum Optimization Lisang Ding 1 Ziang Chen 2 Xinshang Wang 3 Wotao Yin 3 In this work, we propose a novel optimization model termed sum-of-minimum optimization. This model seeks to minimize the sum or average of N objective functions over k parameters, where each objective takes the minimum value of a predefined sub-function with respect to the k parameters. This universal framework encompasses numerous clustering applications in machine learning and related fields. We develop efficient algorithms for solving sum-of-minimum optimization problems, inspired by a randomized initialization algorithm for the classic k-means (Arthur & Vassilvitskii, 2007) and Lloyd s algorithm (Lloyd, 1982). We establish a new tight bound for the generalized initialization algorithm and prove a gradient-descent-like convergence rate for generalized Lloyd s algorithm. The efficiency of our algorithms is numerically examined on multiple tasks, including generalized principal component analysis, mixed linear regression, and small-scale neural network training. Our approach compares favorably to previous ones based on simpler-but-less-precise optimization reformulations. 1. Introduction In this paper, we propose the following sum-of-minimum optimization model: minimize x1,x2,...,xk F(x1, x2, . . . , xk) := i=1 min{fi(x1), fi(x2), . . . , fi(xk)}, (1) 1Department of Mathematics, University of California, Los Angeles, Los Angeles, CA, USA 2Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA 3Decision Intelligence Lab, Alibaba US, Bellevue, WA, USA. Correspondence to: Ziang Chen . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). where x1, x2, . . . , xk are unknown parameters to determine. The cost function F is the average of N objectives where the i-th objective is fi evaluated at its optimal out of the k parameter choices. This paper aims to develop efficient algorithms for solving (1) and analyze their performance. Write [k] = {1, 2, . . . , k} and [N] = {1, 2, . . . , N}. Let (C1, C2, . . . , Ck) be a partition of [N], i.e., Ci s are disjoint subsets of [N] and their union equals [N]. Let Pk N denote the set of all such partitions. Then, (1) is equivalent to minimize (C1,C2,...,Ck) Pk N min x1,x2,...,xk 1 N i Cj fi(xj). (2) It is easy to see (C 1, C 2, . . . , C k) and (x 1, x 2, . . . , x k) are optimal to (2) if and only if (x 1, x 2, . . . , x k) is optimal to (1) and i C j fi(x j) = min{fi(x 1), fi(x 2), . . . , fi(x k)}. Reformulation (2) reveals its clustering purpose. It finds the optimal partition (C 1, C 2, . . . , C k) such that using the parameter x j to minimize the average of fi s in the cluster Cj leads to the minimal total cost. Problem (1) generalizes k-means clustering. Consider N data points y1, y2, . . . , y N and a distance function d( , ). The goal of k-means clustering is to find clustering centroids x1, x2, . . . , xk that minimize F(x1, x2, . . . , xk) = 1 i=1 min j [k]{d(xj, yi)}, which is the average distance from each data point to its nearest cluster center. The literature presents various choices for the distance function d( , ). When d(x, y) = 1 2 x y 2, this optimization problem reduces to the classic k-means clustering problem, for which numerous algorithms have been proposed (Krishna & Murty, 1999; Arthur & Vassilvitskii, 2007; Na et al., 2010; Sinaga & Yang, 2020; Ahmed et al., 2020). Bregman divergence is also widely adopted as a distance measure(Banerjee et al., 2005; Manthey & R oglin, 2013; Liu & Belkin, 2016), defined as d(x, y) = h(x) h(y) h(y), x y , with h being a differentiable convex function. Efficient Algorithms for Sum-Of-Minimum Optimization A special case of (1) is mixed linear regression, which generalizes linear regression and models the dataset {(ai, bi)}N i=1 by multiple linear models. A linear model is a function g(a; x) = a x, which utilizes x as the coefficient vector for each model. Make k copies of the linear model and set the j-th linear coefficient as xj. The loss for each data pair (ai, bi) is computed as the squared error from the best-fitting linear model, specifically minj [k]{ 1 2(g(ai; xj) bi)2}. We aim to search for optimal parameters {xj}k j=1 that minimizes the average loss i=1 min j [k] 2 (g(ai; xj) bi)2 . (3) Paper (Zhong et al., 2016) simplifies this non-smooth problem to the sum-of-product problem: minimize x1,x2,...,xk 1 N j [k] (g(ai; xj) bi)2 , (4) which is smooth. Although (4) is easier to approach due to its smooth objective function, problem (3) is more accurate. Various algorithms are proposed to recover k linear models from mixed-class data (Yi et al., 2014; Shen & Sanghavi, 2019; Kong et al., 2020; Zilber & Nadler, 2023). In (3), the function g( ; x) parameterized by x can be any nonlinear function such as neural networks, and we call this extension mixed nonlinear regression. An application of (1) is generalized principal component analysis (GPCA) (Vidal et al., 2005; Tsakiris & Vidal, 2017), which aims to recover k low-dimensional subspaces, V1, V2, . . . , Vk, from the given data points y1, y2, . . . , y N, which are assumed to be located on or close to the collective union of these subspaces V1 V2 Vk. This process, also referred to as subspace clustering, seeks to accurately segment data points into their respective subspaces (Ma et al., 2008; Vidal, 2011; Elhamifar & Vidal, 2013). Each subspace Vj is represented as Vj = {y Rd : y Aj = 0} where Aj Rd r and A j Aj = Ir, with r being the codimension of Vj. From an optimization perspective, the GPCA task can be formulated as minimize A j Aj=Ir i=1 min j [k] 2 y i Aj 2 . (5) Similar to (4), (Peng & Vidal, 2023) works with the less precise reformulation using the product of y i Aj 2 for smoothness and introduces block coordinate descent algorithm. When k = 1, problem (1) reduces to the finite-sum optimization problem min x F(x) = 1 i=1 fi(x), (6) widely used to train machine learning models, where fi(x) depicts the loss of the model at parameter x on the i-th data point. When the underlying model lacks sufficient expressiveness, problem (6) alone may not yield satisfactory results.To enhance a model s performance, one can train the model with multiple parameters, x1, x2, , xk, k 2, and utilize only the most effective parameter for every data point. This strategy has been successfully applied in various classic tasks, including the aforementioned k-means clustering, mixed linear regression, and the generalized principal component analysis. These applications share a common objective: to segment the dataset into k groups and identify the best parameter for each group. Although no single parameter might perform well across the entire dataset, every data point is adequately served by at least one of the k parameters. By aggregating the strengths of multiple smaller models, this approach not only enhances model expressiveness but also offers a cost-efficient alternative to deploying a singular larger model. Although one might expect that algorithms and analyses for the sum-of-minimum problem (1) to be weaker as (1) subsumes the discussed previous models, we find our algorithms and analyses for (1) to enhance those known for the existing models. Our algorithms extend the k-means++ algorithm (Arthur & Vassilvitskii, 2007) and Lloyd s algorithm (Lloyd, 1982), which are proposed for classic k-means problems. We obtain new bounds of these algorithms for (1). Our contributions are summarized as follows: We propose the sum-of-minimum optimization problem, adapt k-means++ to the problem for initialization, and generalize Lloyd s algorithm to approximately solve the problem. We establish theoretical guarantees for the proposed algorithms. Specifically, under the assumption that each fi is L-smooth and µ-strongly convex, we prove the output of the initialization is O( L2 µ2 ln k)-optimal and that this bound is tight with respect to both k and the condition number L µ. When reducing to k-means optimization, our result recovers that of (Arthur & Vassilvitskii, 2007). Furthermore, we prove an O( 1 T ) convergence rate for generalized Lloyd s algorithms. We numerically verify the efficiency of the proposed framework and algorithms on several tasks, including generalized principal component analysis, ℓ2regularized mixed linear regression, and small-scale neural network training. The results reveal that our optimization model and algorithm lead to a higher successful rate in finding the ground-truth clustering, compared to existing approaches that resort to less accurate reformulations for the sake of smoother optimization landscapes. Moreover, our initialization shows signif- Efficient Algorithms for Sum-Of-Minimum Optimization icant improvements in both convergence speed and chance of obtaining better minima. Our work significantly generalizes classic k-means to handles more complex nonlinear models and provides new perspectives for improving the model performance. The rest of this paper is organized as follows. We introduce the preliminaries and related works in Section 2. We present the algorithms in Section 3. The algorithms are analyzed theoretically in Section 4 and numerically in Section 5. The paper is concluded in Section 6. Throughout this paper, the ℓ2-norm and ℓ2-inner product are denoted by and , , respectively. We employ | | as the cardinal number of a set. 2. Related Work and Preliminary 2.1. Related work Lloyd s algorithm (Lloyd, 1982), a well-established iterative method for the classic k-means problem, alternates between two key steps (Mac Kay, 2003): 1) assigning yi to x(t) j if x(t) j is the closest to yi among {x(t) 1 , x(t) 2 , . . . , x(t) k }; 2) up- dating x(t+1) j as the centroid of all yi s assigned to x(t) j . Although Lloyd s algorithm can be proved to converge to stationary points, the results can be highly suboptimal due to the inherent non-convex nature of the problem. Therefore, the performance of Lloyd s algorithm highly depends on the initialization. To address this, a randomized initialization algorithm, k-means++ (Arthur & Vassilvitskii, 2007), generates an initial solution in a sequential fashion. Each centroid x(0) j is sampled recurrently according to the distribution P(x(0) j = yi) min 1 j j 1 xj yi 2, i [N]. (7) The idea is to sample a data point farther from the current centroids with higher probability, ensuring the samples to be more evenly distributed across the dataset. It is proved in (Arthur & Vassilvitskii, 2007) that EF(x(0) 1 , x(0) 2 , . . . , x(0) k ) 8(ln k + 2)F , (8) where F is the optimal objective value of F. This seminal work has inspired numerous enhancements to the k-means++ algorithm, as evidenced by contributions from (Bahmani et al., 2012; Zimichev et al., 2014; Bachem et al., 2016a;b; Wu et al., 2021; Ren et al., 2022). Our result generalizes the bound in (8), broadening its applicability in sum-of-minimum optimization. 2.2. Definitions and assumptions In this subsection, we outline the foundational settings for our algorithm and theory. For each sub-function fi, we establish the following assumptions. Assumption 2.1. Each fi is L-smooth, satisfying fi(x) fi(y) L x y , x, y Rd, i [N]. Assumption 2.2. Each fi is µ-strongly convex, for all x, y Rd and i [N], fi(y) fi(x) + fi(x) (y x) + µ Let x i denote the optimizer of fi(x) such that f i = fi(x i ), and let S = {x i : 1 i N} represent the solution set. If S comprises l < k different elements, the problem (1) possesses infinitely many global minima. Specifically, we can set the variables x1, x2, . . . , xl to be the l distinct elements in S , while leaving xl+1, xl+2, . . . , xk as free variables. Given these k variables, F(x1, x2, . . . , xk) = 1 N PN i=1 f i . If S contains more than k distinct components, we have the following proposition. Proposition 2.3. Under Assumption 2.2, if |S | k, the optimization problem (1) admits finitely many minimizers. Expanding on the correlation between the number of global minimizers and the size of S , we introduce well-posedness conditions for S . Definition 2.4 (k-separate and (k, r)-separate). We call S k-separate if it contains at least k different elements, i.e., |S | k. Furthermore, we call S (k, r)-separate if there exists 1 i1 < i2 < < ik N such that x ij x ij > 2r for all j = j . Finally, we address the optimality measurement in (1). The norm of the (sub)-gradient is an inappropriate measure for global optimality due to the problem s non-convex nature. Instead, we utilize the following optimality gap. Definition 2.5 (Optimality gap). Given a point x, the optimality gap of fi at x is fi(x) f i . Given a finite point set M, the optimality gap of fi at M is minx M fi(x) f i . When M = {x1, x2, . . . , xk}, the averaged optimality gap of f1, f2, . . . , f N at M is the shifted objective function F(x1, x2, . . . , xk) 1 i=1 f i . (9) The averaged optimality gap in (9) will be used as the optimality measurement throughout this paper. Specifically, in the classic k-means problem, one has f i = 0, so the function F(x1, x2, . . . , xk) directly indicates global optimality. Efficient Algorithms for Sum-Of-Minimum Optimization 3. Algorithms In this section, we introduce the algorithm for solving the sum-of-minimum optimization problem (1). Our approach is twofold, comprising an initialization phase based on k-means++ and a generalized version of Lloyd s algorithm. 3.1. Initialization As the sum-of-minimum optimization (1) can be considered a generalization of the classic k-means clustering, we adopt k-means++. In k-means++, clustering centers are selected sequentially from the dataset, with each data point chosen based on a probability proportional to its squared distance from the nearest existing clustering centers, as detailed in (7). We generalize this idea and propose the following initialization algorithm that outputs initial parameters x(0) 1 , x(0) 2 , . . . , x(0) k for the problem (1). First, we select an index i1 at random from [N], following a uniform distribution, and then utilize a specific method to determine the minimizer x i1, setting x(0) 1 = x i1 = argmin x fi1(x). (10) For j = 2, 3, . . . , k, we sample ij based on the existing variables Mj = {x(0) 1 , x(0) 2 , . . . , x(0) j 1}, with each index i sampled based on a probability proportional to the optimality gap of fi at Mj. Specifically, we compute the minimal optimality gaps v(j) i = min 1 j j 1 fi(x(0) j ) f i , i [N], (11) as probability scores. Each score v(j) i can be regarded as an indicator of how unresolved an instance fi is with the current variables {x(0) j }j 1 j =1. We then normalize these scores w(j) i = v(j) i PN i =1 v(j) i , i [N], (12) and sample ij [N] following the probability distribution w(j) = w(j) 1 , . . . , w(j) N . The j-th initialization is determined by optimizing fij, x(0) j = x ij = argmin x fij(x). (13) We terminate the selection process once k variables x(0) 1 , x(0) 2 , . . . , x(0) k are determined. The pseudo-code of this algorithm is shown in Algorithm 1. We note that the scores v(j) i defined in (11) rely on the optimal objectives f i , which may be computationally intensive to calculate in certain scenarios. Therefore, we propose a Algorithm 1 Initialization 1: Sample i1 uniformly at random from [N] and compute x(0) 1 via (10). 2: for j = 2, 3, . . . , k do 3: Compute v(j) = v(j) 1 , v(j) 2 , . . . , v(j) N via (11). 4: Compute w(j) = w(j) 1 , . . . , w(j) N via (12). 5: Sample ij [N] according to the weights w(j) and compute x(0) j via (13). 6: end for variant of Algorithm 1 by adjusting the scores v(j) i . Specifically, when j 1 parameters x(0) 1 , x(0) 2 , . . . , x(0) j 1 are selected, the score is set as the minimum squared norm of the gradient: v(j) i = min 1 j j 1 fi(x(0) j ) 2 . (14) This variant involves replacing the scores in Step 3 of Algorithm 1 with (14), which is further elaborated in Appendix B. In the context of classic k-means clustering where fi(x) = 1 2 x yi 2 for the i-th data point yi, the score v(j) i in both (11) and (14) reduces to min 1 j j 1 x(0) j yi 2, up to a constant scalar. This initialization algorithm, whether utilizing scores from (11) or (14), aligns with the approach of the classic k-means++ algorithm. 3.2. Generalized Lloyd s algorithm Lloyd s algorithm is employed to minimize the loss in kmeans clustering by alternately updating the clusters and their centroids (Lloyd, 1982; Mac Kay, 2003). This centroid update process can be regarded as a form of gradient descent applied to group functions, defined by the average distance between data points within a cluster and its centroid (Bottou & Bengio, 1994). For our problem (1), we introduce a novel gradient descent algorithm that utilizes dynamic group functions. Our algorithm is structured into two main phases: reclassification and group gradient descent. Reclassification. The goal is for C(t) j to encompass all i [N] where fi is active at x(t) j , allowing us to use the sub-functions fi within C(t) j to update x(t) j . This process leads to the reclassification step as follows: C(t) j = n i [N] : fi(x(t) j ) fi(x(t) j ), j [k] o l 0 is the chosen step size. Alternatively, one might opt for different iterative updates or directly compute: x(t+1) j = argmin x especially if the minimizer of P i C(t) j fi(x) admits a closed form or can be computed efficiently. The pseudo-code consisting of the above two steps is presented in Algorithm 2. Momentum Lloyd s Algorithm. We enhance Algorithm 1 by incorporating a momentum term. The momentum for x(t) j is represented as m(t) j , with 0 < β < 1 and γ > 0 serving as the step sizes for the momentum-based updates. We use the gradient of the group function F (t) j to update the momentum m(t) j . The momentum algorithm admits the following form: x(t+1) j = x(t) j γm(t) j , (18) m(t+1) j = βm(t) j + F (t+1) j (x(t+1) j ). (19) A critical aspect of the momentum algorithm involves updating the classes C(t) j between (18) and (19). Rather than Algorithm 3 Momentum Lloyd s Algorithm 1: Generate the initialization x(0) 1 , x(0) 2 , . . . , x(0) k . Set m(0) 1 , m(0) 2 , . . . , m(0) k to be 0. Set r, α, β, γ. 2: for t = 0, 1, 2, . . . , T do 3: Update x(t) j using (18). 4: if t 0 (mod r) then 5: Compute u(t+1) j via (20). 6: Update C(t+1) j with u(t+1) j in control, such that (21) holds. 7: else 8: C(t+1) j = C(t) j , 1 j k. 9: end if 10: Update the momentum m(t) j via (19). 11: end for reclassifying based on fi evaluated at x(t+1) j , reclassification leverages an acceleration variable: u(t+1) j = 1 1 β (x(t+1) j βx(t) j ). (20) The index i will be classified to C(t+1) j where fi(u(t+1) j ) attains the minimal value. Furthermore, to mitigate abrupt shifts in each class Cj, we implement a controlled reclassification scheme that limits the extent of change in each class: 1 α|C(t) j | |C(t+1) j | α|C(t) j |, (21) where α > 1 serves as a constraint factor. Details of the momentum algorithm are provided in Appendix B. We display the pseudo-code in Algorithm 3. 4. Theoretical Analysis In this section, we prove the efficiency of the initialization algorithm and establish the convergence rate of Lloyd s algorithm. For the initialization Algorithm 1, we show that the ratio between the optimality gap of {x(0) 1 , x(0) 2 , . . . , x(0) k } and the smallest possible optimality gap is O( L2 µ2 ln k). Additionally, by presenting an example where this ratio is Ω( L2 µ2 ln k), we illustrate the bound s tightness. For Lloyd s Algorithms 2 and 3, we establish a gradient decay rate of O( 1 T ), underscoring the efficiency and convergence properties of these algorithms. 4.1. Error bound of the initialization algorithm We define the set of initial points selected by the randomized initialization Algorithm 1, Minit = {x(0) 1 , x(0) 2 , . . . , x(0) k } = {x i1, x i2, . . . , x ik}, as the starting configuration for our optimization process. For simplicity, we use F(Minit) = F(x i1, x i2, . . . , x ik) to Efficient Algorithms for Sum-Of-Minimum Optimization represent the function value at these initial points. Let F be the global minimal value of F, and let f = 1 N PN i=1 f i denote the average of the optimal values of sub-functions. The effectiveness of Algorithm 1 is evaluated by the ratio between EF(Minit) f and F f , which is the expected ratio between the averaged optimality gap at Minit and the minimal possible averaged optimality gap. The following theorem provides a specific bound. Theorem 4.1. Suppose that Assumptions 2.1 and 2.2 hold. Assume that the solution set S is k-separate. Let Minit be a random initialization set generated by Algorithm 1. We have EF(Minit) f 4(2 + ln k) L2 Theorem 4.1 indicates that the relative optimality gap at the initialization set is constrained by a factor of O( L2 µ2 ln k) times the minimal optimality gap. The proof of Theorem 4.1 is detailed in Appendix C. In the classic k-means problem, where L = µ, this result reduces to Theorem 1.1 in (Arthur & Vassilvitskii, 2007). Moreover, the upper bound O( L2 µ2 ln k) is proven to be tight via a lower bound established in the following theorem. Theorem 4.2. Given a fixed cluster number k > 0, there exists an integer N > 0. We can construct N sub-functions {fi}N i=1 satisfying Assumptions 2.1 2.2 and guaranteeing the solution set S to be k-separate. When applying Algorithm 1 over the instances {fi}N i=1, we have EF(Minit) f 1 µ2 ln k (F f ) . (22) The proof of Theorem 4.2 is presented in detail in Appendix C. In both Theorem 4.1 and Theorem 4.2, the performance of Algorithm 1 is analyzed with the assumption that v(j) and f i in (11) can be computed exactly. However, the accurate computation of f i may be impractical due to computational costs. Therefore, we explore the error bounds when the score v(j) approximates (11) with some degree of error. We investigate two types of scoring errors. Additive error. There exists ϵ > 0, we have access to an estimated f i satisfying f i ϵ f i f i + ϵ. (23) Accordingly, we define: v(j) i = min 1 j j 1 max fi(x(0) j ) f i , 0 = max min 1 j j 1 fi(x(0) j ) f i , 0 . (24) Scaling error. There exists a deterministic oracle Ov : [N] Rd R, such that for any x Rd and i [N], c1(fi(x) f i ) Ov(i, x) c2(fi(x) f i ). (25) Set v(j) i = min 1 j j 1 Ov(i, x(0) j ). (26) We first analyze the performance of Algorithm 1 using the score v(j) i with additive error as in (24). We typically require the assumption that the solution set S is (k, q µ )-separate, which guarantees that i=1 min j [l] max (fi(zj) f i ), 0 > 0, for any l < k and z1, z2, . . . , zl Rd. Hence in the initialization Algorithm 1 with score (24), there is at least one v(j) i > 0 in each round. We have the following generalized version of Theorem 4.1 with additive error. Theorem 4.3. Under Assumptions 2.1 and 2.2, suppose that we have { f i }N i=1 satisfying (23) for some noise factor ϵ > 0, and that the solution set S is (k, q µ )-separate. Then for the initialization Algorithm 1 with the scores in (11) replaced by the noisy scores in (24), we have EF(Minit) f 4(2 + ln k) L2 + ϵ 1 + (2 + ln k) 1 + 4L The proof of Theorem 4.3 is deferred to Appendix C. Next, we state a similar result for the scaling-error oracle as in (26), whose proof is deferred to Appendix C. Theorem 4.4. Suppose that Assumptions 2.1 2.2 hold and that the solution set S is k-separate. Then for the initialization Algorithm 1 with the scores in (11) replaced by the scores in (26), we have the following bound: EF(Minit) f L µ + c2 2 c2 1 (2 + ln k)(F f ). Recall that we introduce an alternative score in (14). This score can actually be viewed as a noisy version of (11) with a scaling error. Under Assumptions 2.1 and 2.2, it holds that 2µ(fi(x) f i ) fi(x) 2 2L(fi(x) f i ), for any i [N] and x Rd, which satisfies (25) with c1 = 2µ and c2 = 2L. Therefore, we have a direct corollary of Theorem 4.4. Efficient Algorithms for Sum-Of-Minimum Optimization Corollary 4.5. Suppose that Assumptions 2.1 and 2.2 hold and that the solution set S is k-separate. For the initialization Algorithm 1 with the scores in (11) replaced by the scores in (14), we have EF(Minit) f 4 L2 (2 + ln k)(F f ). 4.2. Convergence rate of Lloyd s algorithm In this subsection, we state convergence results of Lloyd s Algorithm 2 and momentum Lloyd s Algorithm 3, with all proofs being deferred to Appendix D. For Algorithm 2, the optimization process of x(t) j follows a gradient descent scheme on a varying objective function F (t) j , which is the average of all active fi s determined by C(t) j in (15). We have the following gradient-descent-like convergence rate on the gradient norm F (t) j (x(t) j ) . Theorem 4.6. Suppose that Assumption 2.1 is satisfied and we take the step size γ = 1 L in Algorithm 2. Then |C(t) j | N F (t) j (x(t) j ) 2 F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F . For momentum Lloyd s Algorithm 3, we have a similar convergence rate stated as follows. Theorem 4.7. Suppose that Assumption 2.1 holds and that α > 1. For Algorithm 3, there exists a constant γ(α, β, L), such that |C(t) j | N F (t) j (x(t) j ) 2 γ F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F as long as γ γ(α, β, L). 5. Numerical Experiments In this section, we conduct numerical experiments to demonstrate the efficiency of the proposed model and algorithms. Our code with documentation can be found at https://github.com/Lisang Ding/ Sum-of-Minimum_Optimization. 5.1. Comparison between the sum-of-minimum model and the product formulation We consider two optimization models for generalized principal component analysis: the sum-of-minimum formulation (5) and another widely acknowledged formulation given by (Peng & Vidal, 2023; Vidal et al., 2005): minimize A j Aj=Ir j=1 y i Aj 2. (28) The initialization for both formulations is generated by Algorithm 1. We use a slightly modified version of Algorithm 2 to minimize (5) since the minimization of the group functions for GPCA admits closed-form solutions. In particular, we alternatively compute the minimizer of each group objective function as the update of Aj and then reclassify the sub-functions. We use the block coordinate descent (BCD) method (Peng & Vidal, 2023) to minimize (28). The BCD algorithm alternatively minimizes Aj with all other Al (l = j) being fixed. The pseudo-codes of both algorithms are included in Appendix E.1. We set the cluster number k {2, 3, 4}, dimension d {4, 5, 6}, subspace co-dimension r = d 2, and the number of data points N = 1000. The generalization of the dataset {yi}N i=1 is described in Appendix E.1. We set the maximum iteration number as 50 for Algorithm 2 with (5) and terminate the algorithm once the objective function stops decreasing, i.e., the partition/clustering remains unchanged. Meanwhile, we set the iteration number to 50 for the BCD algorithm (Peng & Vidal, 2023) with (28). The sythetic data generation is elaborated in Appendix E.1. The classification accuracy of both methods is reported in Table 1, where the classification accuracy is defined as the maximal matching accuracy with respect to the ground truth over all permutations. We observe that our model and algorithm lead to significantly higher accuracy. This is because, compared to (28), the formulation in (5) models the requirements more precisely, though it is more difficult to optimize due to the non-smoothness. Next, we compare the computational cost for our model and algorithms with that of the product model and the BCD algorithm. We observe that the BCD algorithm exhibited limited improvements in accuracy after the initial 10 iterations. Thus, for a fare comparison, we set both the maximum iterations for our model and algorithms and the iteration number for the BCD algorithm to 10. The accuracy rate and the CPU time are shown in Table 2, from which one can see that the computational costs of our algorithm and the BCD algorithm are competitive, while our algorithm achieves much better classification accuracy. 5.2. Comparison between different initializations We present the performance of Lloyd s Algorithm 2 combined with different initialization methods. The initialization methods adopted in this subsection are: Normal initialization. We initialize variables Efficient Algorithms for Sum-Of-Minimum Optimization Table 1. Cluster accuracy percentages of the sum-of-minimum (vs. sum-of-product) GPCA models after 50 iterations. d = 4 d = 5 d = 6 k = 2 98.24 (81.88) 98.07 (75.90) 98.19 (73.33) k = 3 95.04 (67.69) 94.98 (62.89) 95.94 (60.85) k = 4 91.30 (62.36) 92.92 (59.65) 93.73 (57.89) Table 2. Averaged cluster accuracy percentages (CPU time in seconds) for GPCA after 10 iterations. In each setting, the first and the second rows display the results for the sum-of-minimum and the sum-of-product models respectively. d = 4 d = 5 d = 6 k = 2 97.84 (0.08) 97.93 (0.08) 98.01 (0.08) 81.78 (0.14) 75.76 (0.14) 73.24 (0.15) k = 3 93.34 (0.19) 94.14 (0.19) 95.25 (0.16) 67.18 (0.20) 62.76 (0.22) 60.80 (0.20) k = 4 88.62 (0.32) 91.78 (0.29) 92.62 (0.27) 61.52 (0.26) 59.37 (0.27) 57.82 (0.27) x(0) 1 , x(0) 2 , . . . , x(0) k with i.i.d. samples from the ddimensional standard Gaussian distribution. Uniform seeding index initialization. We uniformly sample k different indices i1, i2, . . . , ik from [N], then we set x ij as the initial value of x(0) j . Careful seeding index initialization. We sample the k indices using Algorithm 1 and initialize x(0) j with the minimizer of the corresponding sub-function. Mixed linear regression. Our first example is the ℓ2regularized mixed linear regression. We add an ℓ2 regularization on each sub-function fi in (3) to guarantee strong convexity, and the sum-of-minimum optimization objective function can be written as i=1 min j [k] 2 (g(ai; xj) bi)2 + λ where {(ai, bi)}N i=1 collects all data points and λ > 0 is a fixed parameter. The dataset {(ai, bi)}N i=1 is generated as described in Appendix E.2. Similar to the GPCA problem, we slightly modify Lloyd s algorithm since the ℓ2-regularized least-square problem can be solved analytically. Specifically, we use the minimizer of the group objective function as the update of xj instead of performing the gradient descent as in (17) or Algorithm 2. We perform the algorithm until a maximum iteration number is met or the objective function value stops decreasing. The detailed algorithm is given in Appendix E.2. In the experiment, the number of samples is set to N = 1000 and we vary k from 4 to 6 and d (the dimension of ai and xj) from 4 to 8. For each problem with fixed cluster number and dimension, we repeat the experiment for 1000 times with different random seeds. In each repeated experiment, we record two metrics. If the output objective value at the last iteration is less than or equal to F(x+ 1 , x+ 2 , . . . , x+ k ), where (x+ 1 , x+ 2 , . . . , x+ k ) is the ground truth that generates the dataset {(ai, bi)}N i=1, we consider the objective function to be nearly optimized and label the algorithm as successful on the task; otherwise, we label the algorithm as failed on the task. Additionally, we record the number of iterations the algorithm takes to output a result. The result is displayed in Table 3. Mixed nonlinear regression. Our second experiment is on mixed nonlinear regression using 2-layer neural networks. We construct k neural networks with the same structure and let the j-th neural network be: ψ(a; Wj, pj, qj, oj) = p j Re LU(Wja + qj) + oj. Here, a is the input data. We let d I be the input dimension and d H be the hidden dimension. The dimensions of the variables are a Rd I, Wj Rd H d I, pj, qj Rd H, oj R. We denote θj = (Wj, pj, qj, oj) as the trainable parameters in the neural network. For each trial, we prepare the ground truth θ+ j and the dataset {(ai, bi)}N i=1 as described in Appendix E.2. We use the squared ℓ2 loss for each neural network and construct the i-th sub-function as: 2(ψ(ai; θ) bi)2 + λ where we still use λ 2 θ 2, λ > 0 as a regularization term. We perform parallel experiments on training the neural networks via Algorithm 2 using three different initialization methods. During the training process of neural networks, stochastic gradient descent is commonly used to manage limited memory, reduce training loss, and improve generalization. Moreover, the ADAM algorithm proposed in (Kingma & Ba, 2014) is widely applied. This optimizer is empirically observed to be less sensitive to hyperparameters, more robust, and to converge faster. To align with this practice, we replace the group gradient descent in Algorithm 2 and the group momentum method in Algorithm 3 with ADAM optimizer-based backward propagation for the corresponding group objective function. We use two metrics to measure the performance of the algorithms. In one set of experiments, we train k neural networks until the value of the loss function F under parameters θ1, θ2, . . . , θk is less than that under θ+ 1 , θ+ 2 , . . . , θ+ k . We record the average iterations required to achieve the optimization loss. In the other set of experiments, we train k neural networks for a fixed number of iterations. Then, we compute the training and testing loss of the trained neural Efficient Algorithms for Sum-Of-Minimum Optimization Table 3. The failing rate (average iteration number) of three initialization methods when solving mixed linear regression problems with different cluster numbers and dimensionality. A smaller failure rate and a lower average iteration number indicate better performance. The least failure rate among the three methods is bolded, and the least average iteration number under the same cluster number and dimension settings is underlined. Init. Method d = 4 d = 5 d = 6 d = 7 d = 8 normal 0.056 (17.577) 0.031 (18.378) 0.038 (19.923) 0.058 (21.631) 0.071 (22.344) unif. seeding 0.057 (16.139) 0.034 (16.885) 0.050 (18.022) 0.055 (18.708) 0.075 (19.959) caref. seeding 0.050 (14.551) 0.036 (15.276) 0.034 (16.020) 0.044 (16.936) 0.051 (17.409) normal 0.161 (26.355) 0.156 (28.844) 0.172 (32.247) 0.238 (35.042) 0.321 (38.324) unif. seeding 0.145 (23.728) 0.136 (25.914) 0.143 (27.671) 0.198 (29.935) 0.256 (32.662) caref. seeding 0.162 (21.552) 0.130 (23.476) 0.143 (25.933) 0.161 (27.268) 0.217 (29.086) normal 0.363 (35.831) 0.382 (41.043) 0.504 (43.999) 0.594 (47.918) 0.739 (48.730) unif. seeding 0.347 (31.536) 0.350 (35.230) 0.408 (39.688) 0.524 (42.453) 0.596 (43.117) caref. seeding 0.339 (29.610) 0.312 (33.460) 0.389 (36.068) 0.463 (39.010) 0.563 (40.320) network, where the training loss on the dataset {(ai, bi)}N i=1 is defined as 1 N PN i=1 minj 1 2(ψ(ai; θj) bi)2 and the testing loss is defined in a similar way. In our experiments, the training dataset size is N = 1000 and the testing dataset size is 200. The testing dataset is generated from the same distribution as the training data. Benefiting from ADAM s robust nature regarding hyperparameters, we use the default ADAM learning rate γ = 1e 3. We set r = 10 in Lloyd s Algorithm 2 and fix the cluster number k = 5. We test on three different (d I, d H) tuples: (5, 3), (7, 5), and (10, 5). The results can be found in Table Table 4. Average epochs for different seeding methods to achieve the ground truth model training loss. (d I, d H) (5,3) (7,5) (10,5) normal 329.4 132.1 130.8 unif. Seeding 233.1 71.2 67.6 caref. Seeding 181.4 49.3 47.2 Table 5. The training (testing) errors (unit: 10e-3) of Lloyd s algorithm with fixed training epoch numbers. (d I, d H) / Iter. (5,3) / 300 (7,5) / 150 (10,5) / 150 normal 4.26 (4.63) 4.57 (5.54) 4.62 (5.82) unif. Seeding 3.86 (4.25) 3.96 (4.77) 3.56 (4.52) caref. Seeding 3.44 (3.93) 3.51 (4.37) 3.39 (4.34) We can conclude from Table 3, 4, and 5 that the careful seeding Algorithm 1 generates the best initialization in most cases. This initialization algorithm results in the fewest iterations required by Lloyd s algorithm to converge, the smallest final loss, and the highest probability of finding the ground-truth clustering. 6. Conclusion This paper proposes a general framework for sum-ofminimum optimization, as well as efficient initialization and optimization algorithms. Theoretically, tight bounds are established for smooth and strongly convex sub-functions fi. Though this work is motivated by classic algorithms for the k-means problem, we extend the ideas and theory significantly for a broad family of tasks. Furthermore, the numerical efficiency is validated for generalized principal component analysis and mixed linear and nonlinear regression problems. Future directions include developing algorithms with provable guarantees for non-convex fi and exploring empirical potentials on large-scale tasks. Acknowledgements Lisang Ding receives support from Air Force Office of Scientific Research Grants MURI-FA9550-18-1-0502. A major part of the work of Ziang Chen was completed during his internship at Alibaba US DAMO Academy. We thank Liangzu Peng for fruitful discussions on GPCA. Impact Statement The proposed unified framework for sum-of-minimum optimization covers a vast array of classic problems and modern machine learning tasks. From a long-range perspective, the ultimate goal of this paper is to provide guidance for improving the performance of machine learning models by using multiple sets of parameters. Therefore, the algorithms and ideas can potentially be applied in almost all fields of machine learning, though this paper only paves the first step. Efficient Algorithms for Sum-Of-Minimum Optimization Ahmed, M., Seraj, R., and Islam, S. M. S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8):1295, 2020. Arthur, D. and Vassilvitskii, S. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027 1035, 2007. Bachem, O., Lucic, M., Hassani, H., and Krause, A. Fast and provably good seedings for k-means. Advances in neural information processing systems, 29, 2016a. Bachem, O., Lucic, M., Hassani, S. H., and Krause, A. Approximate k-means++ in sublinear time. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016b. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. Scalable k-means++. ar Xiv preprint ar Xiv:1203.6402, 2012. Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. Clustering with bregman divergences. Journal of machine learning research, 6(10), 2005. Bottou, L. and Bengio, Y. Convergence properties of the k-means algorithms. Advances in neural information processing systems, 7, 1994. Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11): 2765 2781, 2013. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kong, W., Somani, R., Song, Z., Kakade, S., and Oh, S. Meta-learning for mixed linear regression. In International Conference on Machine Learning, pp. 5394 5404. PMLR, 2020. Krishna, K. and Murty, M. N. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433 439, 1999. Liu, C. and Belkin, M. Clustering with bregman divergences: an asymptotic analysis. Advances in Neural Information Processing Systems, 29, 2016. Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129 137, 1982. Ma, Y., Yang, A. Y., Derksen, H., and Fossum, R. Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM review, 50 (3):413 458, 2008. Mac Kay, D. An example inference task: Clustering. Information theory, inference and learning algorithms, 20: 284 292, 2003. Manthey, B. and R oglin, H. Worst-case and smoothed analysis of k-means clustering with bregman divergences. Journal of computational geometry, 4(1):94 132, 2013. Na, S., Xumin, L., and Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics, pp. 63 67. Ieee, 2010. Peng, L. and Vidal, R. Block coordinate descent on smooth manifolds: Convergence theory and twenty-one examples. In Conference on Parsimony and Learning (Recent Spotlight Track), 2023. Ren, Q., Zhang, D., Zhao, X., Yan, L., Rui, J., et al. A novel hybrid method of lithology identification based on k-means++ algorithm and fuzzy decision tree. Journal of Petroleum Science and Engineering, 208:109681, 2022. Shen, Y. and Sanghavi, S. Iterative least trimmed squares for mixed linear regression. Advances in Neural Information Processing Systems, 32, 2019. Sinaga, K. P. and Yang, M.-S. Unsupervised k-means clustering algorithm. IEEE access, 8:80716 80727, 2020. Tsakiris, M. C. and Vidal, R. Filtrated algebraic subspace clustering. SIAM Journal on Imaging Sciences, 10(1): 372 415, 2017. Vidal, R. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52 68, 2011. Vidal, R., Ma, Y., and Sastry, S. Generalized principal component analysis (gpca). IEEE transactions on pattern analysis and machine intelligence, 27(12):1945 1959, 2005. Wu, J., Shi, L., Yang, L., Xiaxia Niu, Li, Y., Xiaodong Cui, Tsai, S.-B., and Zhang, Y. User value identification based on improved rfm model and k-means++ algorithm for complex data analysis. Wireless Communications and Mobile Computing, 2021:1 8, 2021. Yi, X., Caramanis, C., and Sanghavi, S. Alternating minimization for mixed linear regression. In International Conference on Machine Learning, pp. 613 621. PMLR, 2014. Zhong, K., Jain, P., and Dhillon, I. S. Mixed linear regression with multiple components. Advances in neural information processing systems, 29, 2016. Efficient Algorithms for Sum-Of-Minimum Optimization Zilber, P. and Nadler, B. Imbalanced mixed linear regression. ar Xiv preprint ar Xiv:2301.12559, 2023. Zimichev, E. A., Kazanskii, N. L., and Serafimovich, P. G. Spectral-spatial classification with k-means++ particional clustering. Computer Optics, 38(2):281 286, 2014. Efficient Algorithms for Sum-Of-Minimum Optimization A. Proof of Proposition 2.3 In this section, we provide a proof of the proposition in Section 2. Proposition A.1 (Restatement of Proposition 2.3). Under Assumption 2.2, if |S | k, the optimization problem (1) admits finitely many minimizers. Proof. If |S | = k, then the only minimizer up to a permutation of indices is x1, x2, . . . , xk, such that {x1, x2, . . . , xk} = S . Next we consider the case where |S | > k. Let R be the set of all minimizers of (1). Due to the µ-strong convexity of fi, the set R is nonempty. Let T be the set of all partitions C1, C2, . . . , Ck of [N], such that Cj = for all j [k]. The set T is finite. Next, we show there is an injection from R to T . For X = (x1, x2, . . . , xk) R, we recurrently define CX j = {i [N] | fi(xj) = min l (fi(xl))}\ 1 j j 1CX j . We claim that all CX j s are nonempty. Otherwise, if there is an index j such that CX j = , we have a z S \{x1, x2, . . . , xk}. Replacing the j-th parameter xj with z, we have F(x1, x2, . . . , xk) > F(x1, x2, . . . , xj 1, z, xj+1, . . . , xk). This contradicts the assumption that X is a minimizer of (1). Therefore, X (CX 1 , CX 2 , . . . , CX k ) is a well-defined map from R to T . Consider another Y = (y1, y2, . . . , yk) R. If CX j = CY j for all j [k], due to the µ-strong convexity of fi s, we have yj = argmin z fi(z) = argmin z fi(z) = xj, j [k]. Thus, the map defined above is injective. Overall, R is a finite set. B. Algorithm details In this section, we provide the details of the algorithms presented in Section 3. B.1. Initialization with alternative scores When the score function v(j) i is taken as the squared gradient norm as in (14), the pseudo-code of the initialization can be found in Algorithm 4. B.2. Details on momentum Lloyd s Algorithm In this section, we elaborate on the details of momentum Lloyd s Algorithm 3. We use x(t) 1 , x(t) 2 , . . . , x(t) k as the k variables to be optimized. Correspondingly, we introduce m(t) 1 , m(t) 2 , . . . , m(t) k as their momentum. We use the same notation F (t) j in (16) as the group objective function. In each iteration, we update x using momentum gradient descent and update m using the gradient of the group function. x(t+1) j = x(t) j γm(t) j , m(t+1) j = βm(t) j + F (t+1) j (x(t+1) j ). The update of C(t) j in the momentum algorithm is different from the Lloyd s Algorithm 2. We introduce an acceleration quantity u(t+1) j = 1 1 β (x(t+1) j βx(t) j ). Each class is then renewed around the center u(t+1) j . We update index i [N] to the class C(t+1) j where fi(u(t+1) j ) attains the minimum value among all j [k]. To ensure the stability of the momentum accumulation, we further introduce a Efficient Algorithms for Sum-Of-Minimum Optimization Algorithm 4 Initialization 1: Sample i1 uniformly at random from [N] and compute x(0) 1 = x i1 = argmin x fi1(x). 2: for j = 2, 3, . . . , k do 3: Compute scores v(j) = v(j) 1 , v(j) 2 , . . . , v(j) N via v(j) i = min 1 j j 1 fi(x(0) j ) 2 . 4: Compute the sampling weights w(j) = w(j) 1 , . . . , w(j) N by normalizing {v(j) i }N i=1, w(j) i = v(j) i PN i =1 v(j) i . 5: Sample ij [N] according to the weights w(j) and compute x(0) j = x ij = argmin x fij(x). controlled reclassification method. We set a reclassification factor α > 1. We update C(t) j to C(t+1) j in the following way to ensure 1 α|C(t) j | |C(t+1) j | α|C(t) j |. The key idea is to carefully reclassify each index one by one until the size of one class breaks the above restriction. We construct Cj,0 = C(t) j , j [k] as the initialization of the reclassification. We randomly, non-repeatedly pick indices i from [N] one by one. For l looping from 1 to N, we let Cj,l 1, j [k] be the classification before the l-th random index is picked. Let il be the l-th index sampled. We reassign il to the j-th class, such that fil(u(t+1) j ) = min j [k] fil(u(t+1) j ). There will be at most two classes changed due to the one-index reassignment. We update the class notations from Cj ,l 1 to Cj ,l for all j [N]. If there is any change between Cj ,l 1 and Cj ,l, we check whether 1 α|C(t) j | |Cj | α|C(t) j | holds. If the above restriction holds for all j [N], we accept the reclassification and move on to the next index sample. Otherwise, we stop the process and return C(t+1) j = Cj,l 1, j [k]. If the reclassification trial successfully loops to the last index. We assign C(t+1) j = Cj,N, j [k]. C. Initialization error bounds In this section, we prove the error bounds of the initialization Algorithms 1 and 4. Before our proof, we prepare the following concepts and definitions. Definition C.1. For any nonempty C [N], we define i C x i x i 2. Efficient Algorithms for Sum-Of-Minimum Optimization Definition C.2. Let I [N] be an index set, M Rd be a finite set, we define A(I, M) = X i I min z M(fi(z) fi(x i )), D(I, M) = X i I min z M fi(z) 2. Under the µ-strong convexity and L-smooth Assumptions 2.1 and 2.2, we immediately have 1 2LD(I, M) A(I, M) 1 Besides, for disjoint index sets I1, I2, we have A(I1 I2, M) = A(I1, M) + A(I2, M), D(I1 I2, M) = D(I1, M) + D(I2, M). For the problem (1), the optimal solution exists due to the strong convexity assumption on fi s. We pick one set of optimal solutions z 1, z 2, . . . , z k. We let MOPT = {z 1, z 2, . . . , z k}. Based on this optimal solutions, we introduce (A1, A2, . . . , Ak) as a partition of [N]. Aj s are disjoint with each other and [ j [k] Aj = [N]. Besides, for all i Aj, fi(x) attains minimum at z j over MOPT, fi(z j) fi(x i ) = min j [k] fi(z j ) fi(x i ) . The choice of MOPT and (A1, A2, . . . , Ak) is not unique. We carefully choose them so that Aj are non-empty for each j [k]. Lemma C.3. Suppose that Assumption 2.1 holds. Let I be a nonempty index subset of [N] and let i be sampled uniformly at random from I. We have Ei A(I, {x i }) L Proof. We have the following direct inequality. Ei A(I, {x i }) = 1 i I A(I, {x i }) i I (fi (x i ) fi (x i )) 2 x i x i 2 Lemma C.4. Let M be a fixed finite set in Rd. For two indices i = i , we have A({i}, M) 2L µ A({i }, M) + L x i x i 2 Efficient Algorithms for Sum-Of-Minimum Optimization Proof. We have the following inequality. A({i}, M) = min z M (fi(z) fi(x i )) min z M L( z x i 2 + x i x i 2) µ min z M (fi (z) fi (x i )) + L x i x i 2 µ A({i }, M) + L x i x i 2. Lemma C.5. Given an index set I and a finite point set M, suppose that A(I, M) > 0. If we randomly sample an index i I with probability A({i},M) A(I,M) , then we have the following inequality, EA(I, M {x i }) L2 Proof. We consider the expectation of A(I, M {x i }) over i I. We have the following inequality bound. EA(I, M {x i }) = X A(I, M) A(I, M {x i }) i I min(A({i }, M), fi (x i ) fi (x i )) µ A({i }, M) + L x i x i 2 i I min(A({i }, M), fi (x i ) fi (x i )) i I min(A({i }, M), fi (x i ) fi (x i )) + L A(I, M)|I| i I x i x i 2 X i I min(A({i }, M), fi (x i ) fi (x i )) 2 x i x i 2 + L A(I, M)|I| i I x i x i 2 X i I A({i }, M) i I x i x i 2. Here, (a) holds when applying Lemma C.4. Lemma C.6. For any Al in the optimal partition (A1, A2, . . . , Ak), we have µA(Al, MOPT). Proof. We let yl = 1 |Al| P i Al x i be the geometric center of optimal fi solutions of index set Al. Al = 1 |Al| i Al x i x i 2 i Al x i yl + yl x i 2 Efficient Algorithms for Sum-Of-Minimum Optimization x i yl 2 + yl x i 2 i Al x i yl 2 i Al x i z 2 i Al (fi(z) fi(x i )) µ min z A(Al, {z}) µA(Al.MOPT). Proposition C.7. Let I be an index set, and M be a finite point set. Let z be a minimizer of the objective function P i I (fi(z) fi(x i )). Suppose that A(I, M) > 0. If we sample an index i I with probability A({i},M) A(I,M) , then we have the following inequality: EA(I, M {x i }) 4 L2 A(Al.MOPT). (29) Proof. The deduction of (29) is a direct combination of Lemma C.5 and Lemma C.6. Next we prove that the L2 µ2 bound in (29) is tight. Proposition C.8. Fix the dimension d 1, there exists an integer N. We can construct N µ-strongly convex and L-smooth sub-functions f1, f2, . . . , f N, and a finite set M Rd. We let {fi}N i=1 be the N sub-functions of the sum-of-minimum optimization problem (1). When we sample an index i [N] with probability A({i},M) A([N],M), we have EA([N], M {x i }) L2 µ2 A(Al, MOPT). Proof. For the cases where the dimension d 2, we construct the instance in a more concise way. We consider the following n + 1 points, x i = (1, 0, 0, . . . , 0) Rd, i = 1, 2, . . . , n, x n+1 = ( 1, 0, 0, . . . , 0) Rd. All the elements except the first one of x i are zero. We construct the following functions fi with minimizers x i . fi(y1, y2, . . . , yd) = L 2 (y1 1)2 + µ j=2 y2 j , i = 1, 2 . . . , n, fi(y1, y2, . . . , yd) = µ 2 (y1 + 1)2 + L j=2 y2 j , i = n + 1. We have f i := fi(x i ) = 0 for all i [n + 1]. We construct the finite set M in an orthogonal manner. We let M = {(0, ξ)}, ξ Rd 1 be a single point set. Besides, ξ = m 1. The point ξ = (0, ξ) in M is orthogonal to all x i s. Consider the expectation over the newly sampled index i, we have i =1 min(fi (ξ) fi (x i ), fi (x i ) fi (x i )) = n(L + µm2) n(L + µm2) + (µ + Lm2)2µ + µ + Lm2 n(L + µm2) + (µ + Lm2)2n L We set m = exp(n). As n , we have i =1 min(fi (ξ) fi (x i ), fi (x i ) fi (x i )) = 2µ + 2L2 Efficient Algorithms for Sum-Of-Minimum Optimization In the meanwhile, we have z := argmin z i =1 (fi (z) fi (x i )) = n L µ i =1 (fi (z ) fi (x i )) = 2µn L We have the following error rate: lim n E Pn+1 i =1 min(fi (ξ) fi (x i ), fi (x i ) fi (x i )) Pn+1 i =1(fi (z ) fi (x i )) = 1 + L2 As for the 1D case, we consider the following n + 1 points. We let x i = 1, i = 1, 2, . . . , n, and x n+1 = 0. We construct: 2 (x 1)2, x 1, 2 (x 1)2, x 1, i = 1, 2, . . . , n. 2 (x 1)2 + µ x 1 , x 1, i = n + 1. Each f i has the minimizer x i . Besides, f i := fi(x i ) = 0. We let M = {1 + L µ } be a single point set. Let ξ = 1 + L µ . We have fi(x n+1) f i = L 2 , i = 1, 2, . . . , n, fn+1(x i ) f n+1 = µ 2 , i = 1, 2, . . . , n, 2µ, i = 1, 2, . . . , n, f n+1 = L3 + 2µ2L + µ3 We have the following expectation: i =1 min(fi (ξ) fi (x i ), fi (x i ) fi (x i )) = n L2 2 + L3+2µ2L+µ3 2µ + L3+2µ2L+µ3 n 3 2µ + L2 Besides, we have the minimizer z = n L n L+µ of the objective function Pn+1 i=1 (fi(z) fi(x i )). We have i=1 (fi(z ) fi(x i )) = n Lµ 2(n L + µ) We have the following asymptotic error bound: lim n E Pn+1 i=1 min(fi(ξ) fi(x i ), fi(x i ) fi(x i )) Pn+1 i=1 (fi(z ) fi(x i )) = 3 + L2 Efficient Algorithms for Sum-Of-Minimum Optimization We remark that the orthogonal technique used in the construction of (30) can be applied in other lower bound constructions in the proofs of the initialization Algorithms 1 and 4 as well. Lemma C.9. We consider the sum-of-minimum optimization (1). Suppose that S is k-separate. Suppose that we have fixed indices i1, i2, . . . , ij. We define the finite set Mj = {x i1, x i2, . . . , x ij}. We define the index sets Lj = {l : Al {i1, i2, . . . , ij} = }, Lc j = {l : Al {i1, i2, . . . , ij} = }, Ij = l Lj Al, Ic j = l Lc j Al. Let u = |Lc j|. We sample t u new indices. We let M+ j,s = {x i1, x i2, . . . , x ij, x ij+1, . . . , x ij+s} for 0 s t. In each round of sampling, the probability of ij+s, s > 0, being sampled as i is A({i},M+ j,s 1) A([N],M+ j,s 1). Then we have the following bound, E A([N], M+ j,t) A(Ij, Mj) + 4 L2 A(Ic j , MOPT) (1 + Ht) + u t u A(Ic j , Mj). (31) Here, Ht = 1 + 1 t is the harmonic sum. Proof. We prove by induction on u = |Lc j| and t. We introduce the notation Φj(i) = A({i}, Mj) = min z Mj(fi(z) fi(x i )). We show that if (31) holds for the case (u 1, t 1) and (u, t 1), then it also holds for the case (u, t). We first prove two base cases. Case 1: t = 0, u > 0. E A([N], M+ j,t) = A([N], Mj) = A(Ij, Mj) + A(Ic j , Mj). Case 2: t = 1, u = 1. With probability A(Ij,Mj) A([N],Mj), the newly sampled index ij+1 will lie in Ij, and with probability A(Ic j ,Mj) A([N],Mj), it will lie in Ic j . We have bounds on the conditional expectation E A([N], M+ j,t) ij+1 Ij A([N], Mj), E A([N], M+ j,t) ij+1 Ic j = E A(Ij, M+ j,t) ij+1 Ic j + E A(Ic j , M+ j,t) ij+1 Ic j A(Ij, Mj) + X Φj(i ) P i Ic j Φj(i)A(Ic j , Mj {x i }) (a) A(Ij, Mj) + L2 (b) A(Ij, Mj) + 4 L2 A(Ic j , MOPT) Here, (a) holds when applying Lemma C.5. (b) holds since Ic j is identical to a certain Al as u = 1 and we apply Lemma C.6. Overall, we have the bound: EA([N], M+ j,t) = A(Ij, Mj) A([N], Mj)E A([N], M+ j,t) ij+1 Ij + A(Ic j , Mj) A([N], Mj)E A([N], M+ j,t) ij+1 Ic j A(Ij, Mj) + 4 L2 A(Ic j , MOPT) + A(Ij, Mj) = 2A(Ij, Mj) + 4 L2 A(Ic j , MOPT) Efficient Algorithms for Sum-Of-Minimum Optimization Next, we prove that the case (u, t) holds when the inequality holds for cases (u 1, t) and (u 1, t 1). With probability A(Ij,Mj) A([N],Mj), the first sampled index ij+1 will lie in Ij, and with probability A(Ic j ,Mj) A([N],Mj), it will lie in Ic j . Let We divide into two cases and compute the corresponding conditional expectations. For the case where ij+1 lies in Ij, we have the following bound on the conditional expectation. E A([N], M+ j,t) ij+1 Ij E A(Ij, Mj {x ij+1}) + αA(Ic j , MOPT) (1 + Ht 1) + u t + 1 u A(Ic j , Mj {x ij+1}) ij+1 Ij A(Ij, Mj) + αA(Ic j , MOPT) (1 + Ht 1) + u t + 1 u A(Ic j , Mj). For the case where ij+1 lies in Ic j , we have the following inequality: E A([N], M+ j,t) ij+1 Ic j P i Al Φj(i) P i Ic j Φj(i ) A(Ij Al, Mj {x i }) + αA(Ic j \Al, MOPT) (1 + Ht 1) u 1A(Ic j \Al, Mj {x i }) i Al Φj(i) P i Ic j Φj(i ) A(Ij, Mj) + A(Al, Mj {x i }) + α(A(Ic j , MOPT) A(Al, MOPT)) (1 + Ht 1) u 1 A(Ic j , Mj) A(Al, Mj) (a) A(Ij, Mj) + αA(Ic j , MOPT) (1 + Ht 1) + u t A(Ic j , Mj) X A(Ic j , Mj) (b) A(Ij, Mj) + αA(Ic j , MOPT) (1 + Ht 1) + u t A(Ic j , Mj) 1 u A(Ic j , Mj) = A(Ij, Mj) + αA(Ic j , MOPT) (1 + Ht 1) + u t u A(Ic j , Mj). Here, (a) holds when applying Lemma C.5 and Lemma C.6. (b) holds as l Lc j A(Al, Mj)2 1 l Lc j A(Al, Mj) u A(Ic j , Mj)2. Overall, we have the bound: EA([N], M+ j,t) = A(Ij, Mj) A([N], Mj)E A([N], M+ j,t) ij+1 Ij + A(Ic j , Mj) A([N], Mj)E A([N], M+ j,t) ij+1 Ic j A(Ij, Mj) + αA(Ic j , MOPT) (1 + Ht 1) + u t u A(Ic j , Mj) + 1 u A(Ic j , Mj)A(Ij, Mj) (a) A(Ij, Mj) + αA(Ic j , MOPT) (1 + Ht) + u t u A(Ic j , Mj). Here, (a) holds since u t and A(Ic j , Mj)A(Ij, Mj) A([N], Mj) A(Ij, Mj). The proof concludes. Efficient Algorithms for Sum-Of-Minimum Optimization Theorem C.10 (Restatement of Theorem 4.1). Suppose that the solution set S is k-separate. Let Minit = {x i1, x i2, . . . , x ik} be the initial points sampled by the random initialization Algorithm 1. We have the following bound: E A([N], Minit) 4(2 + ln k) L2 A([N], MOPT). (32) Proof. We start with a fixed index i1, let M1 = {x i1}. Suppose xi1 Al. Then we use Lemma C.9 with u = k 1, t = k 1. Let E A([N], M+ 1,k 1) (A(Al, M1) + αA([N]\Al, MOPT)) (1 + Hk 1) The term EA([N], M+ 1,k 1) can be regarded as the conditional expectation of A([N], Minit) given i1. EA([N], M+ 1,k 1) = E (A([N], Minit)|i1) According to Algorithm 1, the first index i1 is uniformly random in [N]. We take the expectation over i1 and get E A([N], Minit) 1 i Al (A(Al, {x i }) + αA([N]\Al, MOPT)) (1 + Hk 1) i Al A(Al, {x i }) + α A([N], MOPT) 1 l [k] |Al|A(Al, MOPT) l [k] |Al|L A([N], MOPT) 1 l [k] |Al|A(Al, MOPT) l [k] |Al|2L µ A(Al, MOPT) + α A([N], MOPT) 1 l [k] |Al|A(Al, MOPT) αA([N], MOPT)(1 + Hk 1) 4(2 + ln k) L2 A([N], MOPT). Here, (a) holds when applying Lemma C.3. (b) holds as a result of Lemma C.6. When we take fi(x) = 1 the optimization problem (1) reduces to the k-means problem, and Algorithm 1 reduces to the k-means++ algorithm. Therefore, according to (Arthur & Vassilvitskii, 2007), the bound given in Theorem C.10 is tight in ln k up to a constant. Next, we give a more detailed lower bound considering the conditioning number L Theorem C.11 (Restatement of Theorem 4.2). Given a fixed cluster number k > 0, there exists N > 0. We can construct N µ-strongly convex and L-smooth sub-functions {fi}N i=1, whose minimizer set S is k-separate. Besides, the sum-of-min objective function F satisfies that F > f , so that A([N], MOP T ) > 0. When we apply Algorithm 1 to sample the initial centers Minit, we have the following error bound: EA([N], Minit) 1 µ2 ln k A([N], MOPT). (33) Efficient Algorithms for Sum-Of-Minimum Optimization Proof. We construct the following problem. We fix the cluster number to be k. We let the dimension to be 2k. We pick the vertices of a k-simplex as the centers of k clusters. The k-simplex is embedded in a k 1 dimensional subspace. We let the first k elements of the vertices coordinates to be non-zero, while the other elements are zero. We denote the first k elements of the l-th vertex by ξ(l) Rk. We let the k-simplex be centered at the origin, so that the magnitudes ξ(l) s are the same. We let m be the edge length of the simplex. The functions in each cluster follows the orthogonal construction technique in (30). Specifically, in cluster l, we construct n + 1 functions mapping from R2k to R as fi,l(y) = µ 2 y1:d ξ(l) 2 + µ j k+1,j =k+l y2 j + L 2 (yk+l + 1)2, i = 1, 2, . . . , n, fi,l(y) = L 2 y1:d ξ(l) 2 + L j k+1,j =k+l y2 j + µ 2 (yk+l 1)2, i = n + 1. (34) We have a total of N = k(n + 1) sub-functions. We let m = exp(n), n 1, so that {fi,l}n+1 i=1 will be assigned in the same cluster when computing the minimizer of the objective function F. We let el Rk be the l-th unit vector, then the minimizers of the above sub-functions are x i,l = [ξ(l); el](i = 1, 2, . . . , n) and x n+1,l = [ξ(l); el]. We let S be the set of all the minimizers {x1,l}k l=1 {xn+1,l}k l=1. For each cluster l, we can compute i=1 (fi,l(y) f i,l) = 2n Lµ Thus, we have A([N], MOPT) = k 2n Lµ Let M be a nonempty subset of S . We study the optimality gap of F when sampling the new centers based on M. We divide the k clusters into 4 classes as follows: Ca = {l | x 1,l M, x n+1,l M}, Cb = {l | x n+1,l M, x 1,l M}, Cf = {l | x 1,l M, x n+1,l M}, Cu = {l | x 1,l M, x n+1,l M}. We define a = |Ca|, b = |Cb|, u = |Cu|. Consider M as the existing centers, we continue sampling t u new centers using Algorithm 1. Let w 1, w 2, . . . , w t be the newly sampled centers. We define the quantity ϕa,b,u,t = EA([N], M {w 1, w 2, . . . , w t }), which is the expected optimality gap after sampling. We will prove by induction that ϕa,b,u,t αt+1 1 2 n µm2 + µ + L + Lm2 + L + µ (u t) + (2n Lb + 2µa)(1 + Hu) + 2L2 (35) Here Hu is the harmonic series. Gu is recursively defined as: G0 = 0, Gu Gu 1 = β(1 + Hu 1). The parameter 0 < α, β < 1 are chosen as m, β = 1 1 n. We denote the right0hand side of (35) as αt+1φa,b,u,t. We consider the case where t = 0, we have ϕa,b,u,0 = 1 2 n µm2 + µ + L + Lm2 + L + µ u + 2n Lb + 2µa. Efficient Algorithms for Sum-Of-Minimum Optimization In the mean while, φa,b,u,0 = 1 2 n µm2 + µ + L + Lm2 + L + µ u + (2n Lb + 2µa)(1 + Hu) + 2L2 If u = 0, we have ϕa,b,0,0 = φa,b,0,0 αφa,b,0,0. If u 1, then 1 2 n µm2 + µ + L + Lm2 + L + µ u becomes the leading term, 2 n µm2 + µ + L + Lm2 + L + µ u 1 (2n Lb + 2µa)(1 + Hu) + 2L2 α (2n Lb + 2µa)(1 + Hu) + 2L2 Rearrange the left-hand side and the right-hand side of the inequality, we have: 1 2 n µm2 + µ + L + Lm2 + L + µ u 2 n µm2 + µ + L + Lm2 + L + µ u + (2n Lb + 2µa)(1 + Hu) + 2L2 = αφa,b,u,0. Therefore, we have ϕa,b,u,0 αφa,b,u,0. Next, we induct on t. When t 1, we have u 1. We use the one-step transfer technique. We let 2 n µm2 + µ + L + Lm2 + L + µ u + 2µa + 2nb L, 2 n µm2 + µ + L + Lm2 + L + µ , =n(µm2 + µ + L)u 2K ϕa+1,b,u 1,t 1 + (Lm2 + L + µ)u 2K ϕa,b+1,u 1,t 1 + 2n Lb K ϕa,b 1,u,t 1 + 2µa K ϕa 1,b,u,t 1 n(µm2 + µ + L)u 2K αt [A(u t) + (2n Lb + 2µa + 2µ)(1 + Hu 1) + BGu 1] + (Lm2 + L + µ)u 2K αt [A(u t) + (2n Lb + 2µa + 2n L)(1 + Hu 1) + BGu 1] K αt [A(u t + 1) + (2n Lb + 2µa 2n L)(1 + Hu) + BGu] K αt [A(u t + 1) + (2n Lb + 2µa 2µ)(1 + Hu) + BGu] =n(µm2 + µ + L)u 2K αtφa,b,u,t + n(µm2 + µ + L)u 2K αt [2µ(1 + Hu 1) + (2n Lb + 2µa)(Hu 1 Hu) + B(Gu 1 Gu)] + (Lm2 + L + µ)u 2K αtφa,b,u,t Efficient Algorithms for Sum-Of-Minimum Optimization + (Lm2 + L + µ)u 2K αt [2n L(1 + Hu 1) + (2n Lb + 2µa)(Hu 1 Hu) + B(Gu 1 Gu)] K αtφa,b,u,t + 2n Lb K αt(A 2n L(1 + Hu)) + 2µa K αtφa,b,u,t + 2µa K αt(A 2µ(1 + Hu)) =αtφa,b,u,t + 1 2n(µm2 + µ + L)u 2µ β 2µ + 2L2 K αt (Lm2 + L + µ)un L(1 + Hu 1) 1 2(Lm2 + L + µ)uβB(1 + Hu 1) 4µ2a(1 + Hu) 4n2L2b(1 + Hu) =αtφa,b,u,t + 1 2 n(µm2 + µ + L) + (Lm2 + L + µ) (2n Lb + 2µa) + A(2n Lb + 2µa) 2n(µm2 + µ + L)u 2L2 µ + 1 n(2µ + 2L2 µ ) (1 + Hu 1) + (Lm2 + L + µ)un L(1 + Hu 1) 2(Lm2 + L + µ)uβB(1 + Hu 1) 4µ2a(1 + Hu) 4n2L2b(1 + Hu) =αtφa,b,u,t + 1 K αt nm2u(1 + Hu 1)(µ2 + L2) K αt n(µ + L)u L L2 (1 + Hu 1) 1 2(Lm2 + L + µ)uβB(1 + Hu 1) 4µ2a(1 + Hu) 4n2L2b(1 + Hu) (a) αtφa,b,u,t αt+1φa,b,u,t. For (a), we have nm2u(1 + Hu 1)(µ2 + L2) n(µ + L)u L L2 2(Lm2 + L + µ)uβB(1 + Hu 1) 4µ2a(1 + Hu) 4n2L2b(1 + Hu). when m = exp(n) and n 1. Thus the inequality (35) holds. Let u = t = k 1. We have ϕa,b,k 1,k 1 αk (2n Lb + 2µa)(1 + Hk 1) + 2L2 µ + 2µ Gk 1 Let n 100k2. Since m = exp(n) 100k2, then 4, β = 1 1 10k 9 ϕa,b,t 1,t 1 3 µ + 2µ Gk 1. We have the following inequalities: Hk 1 = 1 + 1 2 + + 1 k 1 Z k t dt = ln k, k 1, j=0 (1 + Hj) β t=1 ln t dt = β(k ln k + 1). Efficient Algorithms for Sum-Of-Minimum Optimization Therefore, we have EA([N], Minit) 1 2k ln k 2L2 µ + 2µ = k ln k L2 In the meanwhile, we have an upper bound estimate for A([N], MOPT). We pick Mξ = {[ξ(l); el]}k l=1 as the centers. We have A([N], MOPT) A([N], Mξ) = 2kµ. EA([N], Minit) 1 µ2 A([N], MOPT). We prove two different error bounds when the estimate of fi(z) fi(x i ) is not accurate. We consider the additive and multiplicative errors on the oracle fi(z) fi(x i ). In Algorithm 1, when computing the score v(j) i , we suppose we do not have the exact f i , instead, we have an estimate f i , such that | f i f i | ϵ for a certain error factor ϵ > 0. We define A(I, M) = X i I max min z M fi(z) f i , 0 = X i I min z M max fi(z) f i , 0 . Lemma C.12. Let I be an index set, and M be a finite point set. Suppose that A(I, M) > 0. We sample an index i I with probability A({i},M) A(I,M) , then we have the following inequality: E A(I, M {x i }) |I| 1 + 4L i I (fi(z) fi(x i )). (36) Proof. We have A({i}, M) = max min z M(fi(z) f i ), 0 ϵ + min z M(fi(z) f i ) 2 min z M z x i 2 ϵ + L x i x i 2 + L min z M z x i 2 ϵ + L x i x i 2 + 2L µ min z M(fi (z) fi (x i )) ϵ + L x i x i 2 + 2L µ min z M(fi (z) f i ) ϵ + L x i x i 2 + 2L µ max min z M(fi (z) f i ), 0 ϵ + L x i x i 2 + 2L µ A({i }, M). E A(I, M {x i }) A(I, M) A(I, M {x i }) Efficient Algorithms for Sum-Of-Minimum Optimization i I A({i }, M {x i }) i I min( A({i }, M), max(fi (x i ) f i , 0)) i I n 1 + 2L µ ϵ + L x i x i 2 + 2L µ A({i }, M) o i I min( A({i }, M), max(fi (x i ) f i , 0)) i I x i x i 2 + 2L i I max(fi (x i ) f i , 0) i I x i x i 2 =|I| 1 + 4L ϵ + 2 L + L2 i I x i z 2 i I (fi(z) fi(x i )). Lemma C.13. Suppose that we have fixed indices i1, i2, . . . , ij. We define the finite set Mj = {x i1, x i2, . . . , x ij}. We define the index sets Lj = {l : Al {i1, i2, . . . , ij} = }, Lc j = {l : Al {i1, i2, . . . , ij} = }, Ij = l Lj Al, Ic j = l Lc j Al. Let u = |Lc j|. Suppose that u > 0. We sample t u new indices. We let M+ j,s = {x i1, x i2, . . . , x ij, x ij+1, . . . , x ij+s} for 0 s t. In each round of sampling, the probability of ij+s, s > 0, being sampled as i is A({i},M+ j,s 1) A([N],M+ j,s 1). Then we have the following bound: E A([N], M+ j,t) (1 + Ht) A(Ij, Mj) + |Ic j | 1 + 4L A(Ic j , MOPT) + u t u A(Ic j , Mj). Proof. The key idea of the proof is similar to Lemma C.9. We let µ , β = 4 L2 We prove by induction. When t = 0, the inequality obviously holds. When t > 0, u = 1, we have the inequality: E A([N], M+ j,t) A(Ij, Mj) A([N], Mj) A([N], Mj) + A(Ic j , Mj) A([N], Mj) A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) + A(Ic j , Mj). For the general (t, u) case, E A([N], M+ j,t) can be bounded by two parts. With probability A(Ij,Mj) A([N],Mj), the first sampled index lies in Ij, and the conditional expectation is bounded by: (1 + Ht 1) h A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) i + u t + 1 u A(Ic j , Mj). With probability A(Ic j ,Mj) A([N],Mj), the first sampled index lies in Ic j . The conditional expectation is bounded by: A(Ic j , Mj) n (1 + Ht 1) A(Ij Al, Mj {x i }) + |Ic j \Al|αϵ + βA(Ic j \Al, MOPT) Efficient Algorithms for Sum-Of-Minimum Optimization u 1 A(Ic j \Al, Mj {x i }) A(Ic j , Mj) n (1 + Ht 1) A(Ij, Mj) + A(Al, Mj {x i }) + |Ic j \Al|αϵ +βA(Ic j , MOPT) βA(Al, MOPT) + u t u 1( A(Ic j , Mj) A(Al, Mj)) (1 + Ht 1) A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) + u t u A(Ic j , Mj). Overall, we have the following inequality: E A([N], M+ j,t) A(Ij, Mj) (1 + Ht 1) h A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) i + u t + 1 u A(Ic j , Mj) + A(Ic j , Mj) A([N], Mj) (1 + Ht 1) A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) + u t u A(Ic j , Mj) (1 + Ht) A(Ij, Mj) + |Ic j |αϵ + βA(Ic j , MOPT) + u t u A(Ic j , Mj). Theorem C.14 (Restatement of Theorem 4.3). Suppose that the solution set S is (k, q µ )-separate. Let Minit = {x i1, x i2, . . . , x ik} be the initial points sampled by the random initialization Algorithm 1 with noisy oracles f i . We have the following bound: 1 N EA([N], Minit) ϵ + (2 + ln k) 1 + 4L ϵ + 4(2 + ln k) L2 N A([N], MOPT). Proof. The proof is similar to that of Theorem C.10. We let µ , β = 4 L2 We fix the first index i1. Suppose that i1 lies in Al, we have E A([N], M+ 1,k 1) A(Al, {x i1}) + |[N]\Al|αϵ + βA([N]\Al, MOPT) (1 + Hk 1). E A([N], Minit) A(Al, {x i }) + Nαϵ 1 l [k] |Al|2αϵ A([N], MOPT) 1 l [k] |Al|A(Al, MOPT) l [k] |Al|2αϵ A([N], MOPT) 1 l [k] |Al|A(Al, MOPT) |Al|2ϵ + 2L µ |Al|A(Al, MOPT) + Nαϵ 1 l [k] |Al|2αϵ Efficient Algorithms for Sum-Of-Minimum Optimization A([N], MOPT) 1 l [k] |Al|A(Al, MOPT) (Nαϵ + βA([N], MOPT)) (1 + Hk 1) (2 + ln k) 1 + 4L Nϵ + 4(2 + ln k) L2 A([N], MOPT). Here, (a) holds when applying Lemma C.3. (b) holds when applying 2 Al = L min z i Al x i z 2 2L i Al (fi(z) f i ) = 2L µ A(Al, M(D) OPT). Therefore, we have EA([N], Minit) Nϵ + (2 + ln k) 1 + 4L Nϵ + 4(2 + ln k) L2 A([N], MOPT), 1 N EA([N], Minit) ϵ + (2 + ln k) 1 + 4L ϵ + 4(2 + ln k) L2 N A([N], MOPT). The proof of Theorem 4.4 is similar to the proof of Theorem 4.3, we skip the details here. D. Convergence of Lloyd s algorithm In this section, we provide a convergence analysis for Algorithms 2 and 3. Theorem D.1 (Restatement of Theorem 4.6). In Algorithm 2, we take the step size γ = 1 L. If fi are L-smooth, we have the following convergence result: |C(t) j | N F (t) j (x(t) j ) 2 2L T + 1 F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F . Here, F is the minimum of F. Proof. According to the L-smoothness assumption on fi, F (t) j is also L-smooth, which implies that F (t) j (x(t+1) j ) F (t) j (x(t) j ) + F (t) j (x(t) j ), x(t+1) j x(t) j + L 2 x(t+1) j x(t) j 2 = F (t) j (x(t) j ) 1 2L F (t) j (x(t) j ) 2, 1 2L F (t) j (x(t) j ) 2 F (t) j (x(t) j ) F (t) j (x(t+1) j ), F (t) j (x(t) j ) 2 2L F (t) j (x(t) j ) F (t) j (x(t+1) j ) . Averaging over F (t) j (x(t) j ) 2 with weights |C(t) j |/N, we have |C(t) j | N F (t) j (x(t) j ) 2 2L F(x(t) 1 , x(t) 2 , . . . , x(t) k ) F(x(t+1) 1 , x(t+1) 2 , . . . , x(t+1) k ) . Averaging over t from 0 to T, we have |C(t) j | N F (t) j (x(t) j ) 2 2L T + 1 F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F . Efficient Algorithms for Sum-Of-Minimum Optimization Next, we present a convergence theorem for the momentum algorithm. For simplification, we use the notation U(t) = (u(t) 1 , u(t) 2 , . . . , u(t) k ). We have the following convergence theorem: Theorem D.2 (Restatement of Theorem 4.7). Consider Algorithm 3. Suppose that Assumption 2.1 holds, α > 1, and 2L , (1 β) 3 2 (1 αβ) 1 2 Then it holds that |C(t) j | N F (t) j (x(t) j ) 2 2(1 β) γ F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F Proof. The variable u(t) j satisfies the following property, u(t+1) j u(t) j = 1 1 β (x(t+1) j x(t) j ) β(x(t) j x(t 1) j ) m(t) j βm(t 1) j = γ 1 β F (t) j (x(t) j ). We have the following inequality: F (t) j (u(t+1) j ) F (t) j (u(t) j ) + F (t) j (u(t) j ), u(t+1) j u(t) j + L 2 u(t+1) j u(t) j 2 = F (t) j (u(t) j ) γ 1 β F (t) j (u(t) j ), F (t) j (x(t) j ) + L (1 β)2 F (t) j (x(t) j ) 2 = F (t) j (u(t) j ) γ 1 β F (t) j (u(t) j ) F (t) j (x(t) j ), F (t) j (x(t) j ) (1 β)2 γ 1 β F (t) j (x(t) j ) 2 F (t) j (u(t) j ) + L (1 β)2 γ 1 β F (t) j (x(t) j ) 2 + γ 1 β ϵ 2 F (t) j (u(t) j ) F (t) j (x(t) j ) 2 + γ 1 β 1 2ϵ F (t) j (x(t) j ) 2 F (t) j (u(t) j ) + L (1 β)2 γ 1 β + 1 F (t) j (x(t) j ) 2 + ϵ (1 β)3 m(t 1) j 2. Rearranging the inequality, we have γ 1 β L F (t) j (x(t) j ) 2 F (t) j (u(t) j ) F (t) j (u(t+1) j ) + ϵ (1 β)3 m(t 1) j 2. We sum over j = 1, 2, . . . , k with weights |Cj| |C(t) j | N F (t) j (x(t) j ) 2 |C(t) j | N F (t) j (u(t+1) j ) + ϵ |C(t) j | N m(t 1) j 2. Efficient Algorithms for Sum-Of-Minimum Optimization |C(t) j | N F (t) j (u(t+1) j ) = 1 fi(u(t+1) j ) F(U(t+1)), |C(t) j | N F (t) j (x(t) j ) 2 F(U(t)) F(U(t+1)) + ϵ |C(t 1) j | N m(t 1) j 2. Summing both sides from t = 1 to T, then dividing both sides by T, we have |C(t) j | N F (t) j (x(t) j ) 2 F(U(1)) F(U(T +1)) |C(t 1) j | N m(t 1) j 2. Now, we consider the average term 1 T PT t=1 |C(t) j | N m(t) j 2. For m(t) j , we have m(t) j = βm(t 1) j + F (t) j (x(t) j ) = βtm(0) j + l=0 βl F (t l) j (x(t l) j ) l=1 βt l F (l) j (x(l) j ). We have the following bound on the squared norm of m(t) j : l=1 βt l F (l) j (x(l) j ) s=1 βt s !2 βt l Pt s=1 βt s F (l) j (x(l) j ) s=1 βt s !2 t X βt l Pt s=1 βt s F (l) j (x(l) j ) 2 l=1 βt l F (l) j (x(l) j ) 2 . Here, (a) applies as an instance of Jensen s inequality. Averaging the above inequality over t = 1, 2, . . . , T, we obtain |C(t) j | N m(t) j 2 1 1 β 1 N 1 T l=1 |C(t) j |βt l F (l) j (x(l) j ) 2 T 1 N 1 1 β t=l |C(t) j |βt l F (l) j (x(l) j ) 2 Efficient Algorithms for Sum-Of-Minimum Optimization T 1 N 1 1 β t=l |C(l) j |αt lβt l F (l) j (x(l) j ) 2 T 1 1 β 1 1 αβ |C(l) j | N F (l) j (x(l) j ) 2 Substituting the above inequality back into (38), we obtain (1 β)4(1 αβ) |C(t) j | N F (t) j (x(t) j ) 2 F(U(1)) F(U(T +1)) ϵ = (1 β) 3 2 (1 αβ) 1 2 and rearrange the above inequality. Thus, we have (1 β)2 α 1 2 Lβγ2 2(1 β) 5 2 (1 αβ) 1 2 |C(t) j | N F (t) j (x(t) j ) 2 F(U(1)) F(U(T +1)) Since we initialize m(0) j = 0, we have x(1) j = x(0) j γm(0) j = x(0) j , u(1) j = x(0) j . Besides, since F(U(T +1)) F , we have (1 β)2 α 1 2 Lβγ2 2(1 β) 5 2 (1 αβ) 1 2 |C(t) j | N F (t) j (x(t) j ) 2 F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F 2L , (1 β) 3 2 (1 αβ) 1 2 we have 1 T |C(t) j | N F (t) j (x(t) j ) 2 2(1 β) γ F(x(0) 1 , x(0) 2 , . . . , x(0) k ) F E. Supplementary experiment details In this section, we provide details on the experiments described in Section 5. E.1. Supplementary details for Section 5.1 We elaborate on the generation of the synthetic data for the GPCA experiment in Section 5.1. First, we uniformly generate k pairs of orthonormal vectors {ϵ1,j, ϵ2,j} for j = 1, 2, . . . , k. Each pair is generated uniformly at random, with ϵ1,j and ϵ2,j forming the basis of the j-th subspace. Efficient Algorithms for Sum-Of-Minimum Optimization For each data point i [N], we independently generate two Gaussian samples ξ1,i, ξ2,i. Next, we sample an index ji [k] uniformly at random. We then let xi = ξ1,iϵ1,ji + ξ2,iϵ2,ji. We provide in Algorithm 5 a detailed pseudo-code of Lloyd s algorithm for solving the GPCA problem in the sum-ofminimum formulation (5), which consists of two steps in each iteration, say updating the clusters via (15) and precisely compute the minimizer of each group objective function min A j Aj=Ir y i Aj 2 = 1 tr A j yiy Aj = tr Algorithm 5 Lloyd s Algorithm for generalized principal component analysis 1: Initialize A(0) 1 , A(0) 2 , . . . , A(0) k . Set F ( 1) = + . 2: for t = 0, 1, 2, . . . , max iterations do 3: Compute F (t) = F(A(t) 1 , A(t) 2 , . . . , A(t) k ). 4: if F (t) = F (t 1) then 5: Break. 6: end if 7: Compute the partition {C(t) j }k j=1 via (15). 8: for j = 1, 2, . . . , k do 9: if C(t) j = then 10: Compute the matrix 1 and its r orthonormal eigenvectors v1, v2, . . . , vr corresponding to the smallest r eigenvalues. 11: Set A(t+1) j = v1 v2 . . . vr . 12: else 13: x(t+1) j = x(t) j . 14: end if 15: end for 16: end for We implement the BCD algorithm (Peng & Vidal, 2023) for the following optimization problem: min A j Aj=Ir j [k] y i Aj 2. (39) For any j [k], when Al is fixed for all l [k]\{j}, the problem in (39) is equivalent to: min A j Aj=Ir i=1 wij y i Aj 2 = 1 i=1 wijtr A j yiy i Aj = tr i=1 wijyiy i where the weights wij are given by: l =j y i Al 2. The detailed pseudo-code can be found in Algorithm 6. Efficient Algorithms for Sum-Of-Minimum Optimization Algorithm 6 Block coordinate descent for generalized principal component analysis (Peng & Vidal, 2023) 1: Initialize A(0) 1 , A(0) 2 , . . . , A(0) k . 2: for t = 0, 1, 2, . . . , max iterations do 3: for j = 1, 2, . . . , k do 4: Compute the weights: w(t) ij = Y lj y i A(t) l 2. 5: Compute the matrix: 1 N i=1 w(t) ij yiy i and its r orthonormal eigenvectors v1, v2, . . . , vr corresponding to the smallest r eigenvalues. 6: Set A(t+1) j = v1 v2 . . . vr . 7: end for 8: end for E.2. Supplementary details for Section 5.2 Mixed linear regression Here, we provide the detailed pseudo-code for Lloyd s algorithm used to solve ℓ2-regularized mixed linear regression problem in Section 5. Each iteration of the algorithm consists of two steps: reclassification and cluster parameter update. We alternatively reclassify indices i to C(t) j using (15) and update the cluster parameter x(t) j for nonempty clusters C(t) j using: aia i + λ|C(t) j |I so that x(t+1) j is exactly the minimizer of the group objective function. The algorithm continues until F (t) stops decreasing after x(t) s update or a max iteration number is reached. The pseudo-code is shown in Algorithm 7. Algorithm 7 Lloyd s Algorithm for mixed linear regression 1: Initialize x(0) 1 , x(0) 2 , . . . , x(0) k . Set F ( 1) = + . 2: for t = 0, 1, 2, . . . , max iterations do 3: Compute F (t) = F(x(t) 1 , x(t) 2 , . . . , x(t) k ). 4: if F (t) = F (t 1) then 5: Break. 6: end if 7: Compute the partition {C(t) j }k j=1 via (15). 8: for j = 1, 2, . . . , k do 9: if C(t) j = then 10: Compute x(t+1) j using (40). 11: else 12: x(t+1) j = x(t) j . 13: end if 14: end for 15: end for The dataset {(ai, bi)}N i=1 for the ℓ2-regularized mixed linear regression is synthetically generated in the following way: Fix the dimension d and the number of function clusters k, and sample x+ 1 , x+ 2 , . . . , x+ k i.i.d. N(0, Id) as the linear Efficient Algorithms for Sum-Of-Minimum Optimization coefficients of k ground truth regression models. For i = 1, 2, . . . , N, we independently generate data ai N(0, Id), class index ci Uniform([k]), noise ϵi N(0, σ2), and compute bi = a i x+ ci + ϵi. In the experiment, the noise level is set to σ = 0.01 and the regularization factor is set to λ = 0.01. Mixed non-linear regression The ground truth θ+ j s are sampled from a standard Gaussian. The dataset {(ai, bi)}N i=1 is generated in the same way as in the mixed linear regression experiment. We set the variance of the Gaussian noise on the dataset to σ2 = 0.012 and use a regularization factor λ = 0.01.