# subset_selection_by_pareto_optimization_with_recombination__f14e00e8.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Subset Selection by Pareto Optimization with Recombination Chao Qian,1 Chao Bian,1 Chao Feng2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China qianc@lamda.nju.edu.cn, chaobian12@gmail.com, chaofeng@mail.ustc.edu.cn Subset selection, i.e., to select a limited number of items optimizing some given objective function, is a fundamental problem with various applications such as unsupervised feature selection and sparse regression. By employing a multi-objective evolutionary algorithm (EA) with mutation only to optimize the given objective function and minimize the number of selected items simultaneously, the recently proposed POSS algorithm achieves state-of-the-art performance for subset selection. In this paper, we propose the PORSS algorithm by incorporating recombination, a characterizing feature of EAs, into POSS. We prove that PORSS can achieve the optimal polynomial-time approximation guarantee as POSS when the objective function is monotone, and can find an optimal solution efficiently in some cases whereas POSS cannot. Extensive experiments on unsupervised feature selection and sparse regression show the superiority of PORSS over POSS. Our analysis also theoretically discloses that recombination from diverse solutions can be more likely than mutation alone to generate various variations, thereby leading to better exploration; this may be of independent interest for understanding the influence of recombination. Introduction This paper considers a general problem, i.e., subset selection, which is to select a subset of size at most k from a total set of n items for maximizing (or minimizing) some given objective function f. This problem arises in various real-world applications, such as maximum coverage (Feige 1998), sparse regression (Miller 2002), influence maximization (Kempe, Kleinberg, and Tardos 2003), sensor placement (Krause, Singh, and Guestrin 2008), document summarization (Lin and Bilmes 2011) and unsupervised feature selection (Farahat, Ghodsi, and Kamel 2011), to name a few. Subset selection is generally NP-hard, and much efforts have been devoted to developing polynomial-time approximation algorithms. The greedy algorithm, which iteratively selects one item with the largest marginal gain, has been shown to be a good approximation solver. When the involved objective function f satisfies the monotone property, the greedy algorithm can achieve the (1 e γ)- This work was supported by the NSFC (61603367). Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. approximation guarantee, where γ is the submodularity ratio measuring how close f is to submodularity (Das and Kempe 2011). Particularly, for submodular objective functions, γ = 1 and the approximation guarantee becomes 1 1/e, which is optimal, i.e., cannot be improved by any polynomial-time algorithm (Nemhauser and Wolsey 1978). Harshaw et al. (2019) have recently proved that the general approximation guarantee of (1 e γ) is also optimal. Based on Pareto optimization, Qian et al. (2015) proposed the POSS algorithm for subset selection. The idea is to reformulate subset selection as a bi-objective optimization problem maximizing the given objective and minimizing the subset size simultaneously, then solve the problem by a multiobjective EA (MOEA), and finally select the best solution with size at most k from the generated solution set. It has been shown that POSS can achieve the optimal polynomialtime approximation guarantee, 1 e γ, and can be significantly better than the greedy algorithm in applications, e.g., unsupervised feature selection and sparse regression. Moreover, POSS is robust against uncertainties (Qian et al. 2017; Roostapour et al. 2019), and easily distributed for large-scale tasks (Qian et al. 2016; 2018; Qian 2019). The optimization engine of POSS is the employed MOEA, which iteratively reproduces new solutions for solving the reformulated bi-objective problem. For EAs, mutation and recombination (or called crossover) are two popular operators for reproduction (B ack 1996); the former changes one solution randomly whereas the latter mixes up two or more solutions. POSS applies mutation only and has performed well, while recombination, as a core feature of EAs, may be helpful to further improve its performance. In this paper, we propose the PORSS algorithm for subset selection by introducing recombination into POSS. Two common recombination operators are considered: one-point recombination and uniform recombination. In theory, we prove that for subset selection with monotone objective functions, PORSS can achieve the optimal polynomial-time approximation guarantee, 1 e γ; for one concrete example of subset selection, PORSS can be significantly faster than POSS to find an optimal solution. We also conduct experiments on the applications of unsupervised feature selection and sparse regression with various real-world data sets, showing that within the same running time, PORSS can almost always achieve better performance than POSS. Note that recombination is understood only at preliminary level, though there are great efforts devoted to analyzing its influence, e.g., (Neumann and Theile 2010; Doerr et al. 2013; Qian, Yu, and Zhou 2013; Oliveto and Witt 2014; Sudholt 2017; Dang et al. 2018). Our analysis theoretically discloses that recombining diverse solutions is more likely than mutation to generate various variations, and thus to escape from local optima; this may help to understand this kind of operator. Subset Selection Given a ground set V = {v1, v2, . . . , vn}, we study the functions f : 2V R over subsets of V . A set function f is monotone if S T, f(S) f(T). Assume w.l.o.g. that monotone functions are normalized, i.e., f( ) = 0. A set function f is submodular (Nemhauser, Wolsey, and Fisher 1978) if S T V , v T \S f(S {v}) f(S) . For a general set function f, the notion of submodularity ratio in Definition 1 is used to measure to what extent f has the submodular property. When f is monotone, it holds that (1) S, l : 0 γS,l(f) 1, and (2) f is submodular iff S, l : γS,l(f) = 1. Definition 1 (Submodularity Ratio (Das and Kempe 2011)). The submodularity ratio of a set function f : 2V R with respect to a set S V and a parameter l 1 is γS,l(f) = min L S,T :|T | l,T L= v T (f(L {v}) f(L)) f(L T) f(L) . The subset selection problem as presented in Definition 2 is to select a subset S of V such that a given objective f is maximized with the constraint |S| k. For a monotone function f, the greedy algorithm, which iteratively adds one item with the largest marginal gain until k items are selected, can achieve an approximation guarantee of (1 e γS,k(f)) (Das and Kempe 2011), where S is the subset output by the greedy algorithm. The optimality of this approximation guarantee was known only in the case where γS,k(f) = 1, i.e., f is submodular (Nemhauser and Wolsey 1978), and has recently been proved in the general case (Harshaw et al. 2019). Definition 2 (Subset Selection). Given all items V = {v1, v2, . . . , vn}, an objective function f and a budget k, to find a subset of at most k items maximizing f, i.e., arg max S V f(S) s.t. |S| k. (1) Here are two applications of subset selection with monotone, but not necessarily submodular, objective functions, that will be studied in this paper. Unsupervised feature selection as presented in Definition 3 is to select at most k columns from a matrix A to best approximate A. Some notations: ( )+: Moore-Penrose inverse of a matrix; F : Frobenius norm of a matrix; | |: number of columns of a matrix. The goodness of approximation is measured by the sum of squared errors between the original matrix A and the approximation SS+A, where SS+ is the projection matrix onto the space spanned by the columns of S. Note that a submatrix of A can be seen as a subset of all columns of A. Definition 3 (Unsupervised Feature Selection). Given a matrix A Rm n and a budget k, to find a submatrix S of A with at most k columns minimizing A SS+A 2 F , i.e., arg min S: a submatrix of A A SS+A 2 F s.t. |S| k. For the ease of theoretical treatment, this minimization problem is often equivalently reformulated as a maximization problem (Bhaskara et al. 2016; Ordozgoiti, Canaval, and Mozo 2018): arg max S: a submatrix of A SS+A 2 F s.t. |S| k. Sparse regression (Miller 2002) as presented in Definition 4 is to find a sparse approximation solution to the linear regression problem. Note that S and its index set {i | vi S} are not distinguished for notational convenience, and all variables are assumed w.l.o.g. to be normalized to have expectation 0 and variance 1. Definition 4 (Sparse Regression). Given observation variables V = {v1, . . . , vn}, a predictor variable z and a budget k, to find at most k variables from V maximizing the squared multiple correlation (Johnson and Wichern 2007), i.e., arg max S V R2 z,S = 1 MSEz,S s.t. |S| k, where MSEz,S denotes the mean squared error, i.e., MSEz,S = minα R|S| E[ z i S αivi 2 ]. The POSS Algorithm Based on Pareto optimization, a new algorithm POSS for subset selection has been proposed (Friedrich and Neumann 2015; Qian, Yu, and Zhou 2015). Note that a subset S of V can be represented by a binary vector x {0, 1}n, where xi = 1 iff the item vi S, and we will not distinguish them for notational convenience. POSS reformulates the original problem Eq. (1) as a bi-objective minimization problem: arg minx {0,1}n (f1(x), f2(x)), (2) f1(x) = + , |x| 2k f(x), otherwise , f2(x) = |x|. Thus, POSS maximizes the original objective function f and minimizes the subset size |x| simultaneously. By setting f1 to + for |x| 2k, overly infeasible solutions, i.e., solutions with large constraint violation, are excluded. To compare solutions in bi-objective optimization, POSS uses the domination relationship. For two solutions x and x , x weakly dominates x , denoted as x x , if f1(x) f1(x ) f2(x) f2(x ); x dominates x , denoted as x x , if x x and either f1(x) < f1(x ) or f2(x) < f2(x ); they are incomparable, if neither x x nor x x. POSS employs a simple MOEA with mutation only, which is slightly modified from the GSEMO algorithm (Giel Algorithm 1 POSS Algorithm Input: V = {v1, . . . , vn}; objective f : 2V R; budget k Parameter: the number T of iterations Output: a subset of V with at most k items Process: 1: Let x = 0n, P = {x} and t = 0; 2: while t < T do 3: Select x from P randomly; 4: Apply bit-wise mutation on x to generate x ; 5: if z P such that z x then 6: P = (P \ {z P | x z}) {x } 7: end if 8: t = t + 1 9: end while 10: return arg maxx P,|x| k f(x) 2003; Laumanns, Thiele, and Zitzler 2004), to solve the biobjective problem Eq. (2). As described in Algorithm 1, it starts from 0n representing the empty set, and iteratively tries to improve the solutions in the population P (lines 2-9). In each iteration, a solution x is selected from P uniformly at random, and used to generate a new solution x by the bit-wise mutation operator, presented as follows: Bit-wise mutation: flip each bit of a solution x {0, 1}n independently with probability 1/n. The newly generated solution x is then used to update P in lines 5-7, making P contain only non-dominated solutions generated-so-far. That is, if x is not dominated by any solution in P (line 5), it will be added into P, and meanwhile those archived solutions weakly dominated by x will be deleted (line 6). After running T iterations, the best solution w.r.t. the original problem Eq. (1) is selected from P in line 10 as the final output solution. For subset selection with monotone objective functions, POSS has been proved to achieve the same general approximation guarantee as the greedy algorithm in polynomial expected running time, i.e., to achieve the optimal polynomialtime approximation guarantee (Qian, Yu, and Zhou 2015). Furthermore, it has been empirically shown that POSS can achieve significantly better performance than the greedy algorithm in some applications, e.g., unsupervised feature selection (Feng, Qian, and Tang 2019) and sparse regression (Qian, Yu, and Zhou 2015). The PORSS Algorithm To reproduce new solutions in each iteration, POSS applies the mutation operator, which simulates the mutation phenomena in DNA transformation. It is known that recombination is another popular operator for reproduction, which simulates the chromosome exchange phenomena in zoogamy reproduction, and typically appears in various real EAs, e.g., the popularly used algorithm NSGA-II (Deb et al. 2002). In this section, we propose a new Pareto Optimization algorithm with Recombination for Subset Selection, briefly called PORSS. As described in Algorithm 2, PORSS employs recombination and mutation together, rather than mutation only, to generate new solutions in each iteration. In Algorithm 2 PORSS Algorithm Input: V = {v1, . . . , vn}; objective f : 2V R; budget k Parameter: the number T of iterations Output: a subset of V with at most k items Process: 1: Let x = 0n, P = {x} and t = 0; 2: while t < T do 3: Select x, y from P randomly with replacement; 4: Apply recombination on x, y to generate x , y ; 5: Apply bit-wise mutation on x , y to generate x , y ; 6: for each q {x , y } 7: if z P such that z q then 8: P = (P \ {z P | q z}) {q} 9: end if 10: end for 11: t = t + 1 12: end while 13: return arg maxx P,|x| k f(x) line 3, two solutions are selected randomly from the population P with replacement, and then recombined to generate new solutions in line 4. We consider two commonly used recombination operators: One-point recombination: select i {1, 2, . . . , n} randomly, and exchange the first i bits of two solutions; Uniform recombination: exchange each bit of two solutions independently with probability 1/2. For example, for solutions 0n and 1n, two new solutions 0n/21n/2 and 1n/20n/2 can be generated by one-point recombination with probability 1/n, and by uniform recombination with probability (1/2n) 2, where the factor 2 is included due to the symmetry. In line 5, the two solutions generated by recombination are further mutated to generate another two ones, which are used to update the population P in lines 6-10. Influence of Recombination To understand the influence of recombination intuitively, we compare the distribution of the number of bits flipped with/without recombination. Suppose two solutions x, y selected in line 3 of Algorithm 2 have Hamming distance d, denoted as H(x, y) = d. Let x , y denote the two solutions generated by recombination and mutation in line 5. We analyze the probability for x or y to have Hamming distance j with x, denoted as Q(d, j). That is, Q(d, j) = P(H(x, x )=j H(x, y )=j | H(x, y)=d). For one-point and uniform recombination, we use the notations Qo(d, j) and Qu(d, j), respectively. Let zm denote the solution generated from a solution z by bit-wise mutation. By turning off recombination, i.e., deleting line 4 of Algorithm 2, we analyze the corresponding probability Qm(d, j)=P(H(x, xm)=j H(x, ym)=j|H(x, y)=d). In the following, we compare Qo(d, j), Qu(d, j) with Qm(d, j), to examine the influence of recombination. Given a solution z with Hamming distance i from x, let qi,j denote the probability for the Hamming distance to become j by bit-wise mutation, i.e., qi,j = P(H(x, zm) = j | H(x, z) = i). For Qm(d, j), as H(x, x) = 0 and H(x, y) = d, we have Qm(d, j) = q0,j + qd,j q0,j qd,j. By uniform recombination, x, y exchange i different bits with probability d i (1/2)i(1/2)d i = d i (1/2)d, generating two solutions x , y where H(x, x ) = i and H(x, y ) = d i. Note that x and y have totally d different bits. Considering the mutation behavior on x , y , we have 2d (qi,j + qd i,j qi,j qd i,j) . Consider one-point recombination. 1 i d, there exists l i such that exchanging the first l bits of x, y can generate two solutions x , y where H(x, x ) = i and H(x, y ) = d i. Thus, we have i=1 (qi,j + qd i,j qi,j qd i,j). Because it is sufficient to keep all bits unchanged in mutation, qj,j (1 1/n)n 1/(2e). Thus, we have (a) j d : Qo(d, j) 1/n q(j, j) = Ω(1/n); (b) j d : Qu(d, j) d j 2d q(j, j)=Ω d j By analyzing q0,j and qd,j, we can derive that, 0 < j0 d, (c.1) for j < j0 : Ω (1/j)j Qm(d, j) O (e/j)j ; (c.2) for j j0 : d j Qm(d, j) O e d j d The detailed analysis for Qm(d, j) is provided in the supplementary material due to space limitations. According to (c.1) and (c.2), the number of bits flipped by mutation only is strongly concentrated around two extreme values, 0 and d. When j increases from 0 to j0 or decreases from d to j0, Qm(d, j) decays super-exponentially. Particularly, for j = d/2, q(0, d/2) n d/2 (1/n)d/2 1/(d/2)!, q(d, d/2) d d/2 (1/n)d/2 1/(d/2)!, and thus, Qm(d, d/2) 2 (d/2)! 2 e(d/(2e))d/2 2e where the second inequality holds by Stirling s formula. According to (b), the number of bits flipped by uniform recombination and mutation is concentrated around d/2, but Qu(d, j) is always lower bounded by Ω(1/2d), which is much greater than Qm(d, d/2) in Eq. (3) when d is large. According to (a), j d : Qo(d, j) Ω(1/n), implying that the number of bits flipped by one-point recombination and mutation is relatively uniformly distributed. Therefore, from diverse solutions, i.e., when d is large, recombination can ease flipping any number of bits, and may lead to better exploration and thus a better ability of escaping from local optima. The advantage of recombination will be verified by theoretical analysis and empirical study. Theoretical Analysis As introduced before, the greedy algorithm and POSS can achieve the optimal polynomial-time approximation guarantee for subset selection with monotone objective functions. A natural question is whether PORSS can keep the optimal approximation. We give the positive answer by proving Theorem 1, i.e., PORSS achieves the approximation guarantee of (1 e γmin) in polynomial expected running time. Let OPT denote the optimal function value. The proof is inspired by the analysis of POSS (Qian, Yu, and Zhou 2015). Lemma 1. (Qian et al. 2016) Let f : {0, 1}n R+ be a monotone function. For any x {0, 1}n, there exists one item v / x such that f(x {v}) f(x) γx,k k (OPT f(x)). Theorem 1. For subset selection with any monotone f, the expected number of iterations until PORSS with one-point or uniform recombination finds a solution x with |x| k and f(x) (1 e γmin) OPT is polynomial, where γmin = minx:|x|=k 1 γx,k. Proof. Let Jmax be the maximum value of j {0, 1, . . . , k} such that in the population P, there exists a solution x with |x| j and f(x) (1 (1 γmin/k)j) OPT. That is, Jmax = max{j {0, 1, . . . , k} | x P : |x| j f(x) (1 (1 γmin/k)j) OPT}. We only need to analyze the expected number of iterations until Jmax = k, which implies that there exists one solution x P satisfying that |x| k and f(x) (1 (1 γmin/k)k) OPT (1 e γmin) OPT. As PORSS starts from 0n, Jmax is initially 0. Assume that currently Jmax = i < k. Let x denote a solution corresponding to Jmax = i, i.e., |x| i and f(x) (1 (1 γmin/k)i) OPT. First, Jmax will not decrease. This is because deleting x from P in line 8 of Algorithm 2 implies that x is weakly dominated by the newly included solution q, satisfying that |q| |x| i and f(q) f(x) (1 (1 γmin/k)i) OPT. Next, we analyze the probability of increasing Jmax in one iteration. Consider the case that the two selected solutions in line 3 of Algorithm 2 are both x, occurring with probability (1/|P|) (1/|P|) due to uniform selection with replacement. For two identical solutions, either one-point or uniform recombination in line 4 makes no changes. Thus, in line 5, x is used to generate a new solution by bit-wise mutation, and this process is implemented twice independently. For bit-wise mutation on x, according to Lemma 1, a new solution x satisfying f(x ) f(x) (γx,k/k) (OPT f(x)) can be generated by flipping only one specific 0 bit of x (i.e., adding a specific item into x), occurring with probability (1/n)(1 1/n)n 1 1/(en). As f(x) (1 (1 γmin/k)i) OPT, we have f(x ) (1 γx,k/k) f(x) + (γx,k/k) OPT (1 (1 γx,k/k)(1 γmin/k)i) OPT (1 (1 γmin/k)i+1) OPT. Note that the last inequality holds by γx,k γmin, because |x| < k and γx,k decreases with x. As x is mutated twice independently in line 5, such a new solution x can be generated with probability at least 1 (1 1/(en))2 = 2/(en) 1/(en)2. It is clear that |x | = |x|+1 i+1. Then, x will be added into P; otherwise, x must be dominated by one archived solution in line 7 of Algorithm 2, and this implies that Jmax has been larger than i, contradicting with the assumption Jmax = i. After adding x into P, Jmax i+1. Thus, Jmax can increase by at least 1 in one iteration with probability at least (1/|P|2) (2/(en) 1/(en)2). By the procedure of updating the population P in lines 6-10, the solutions in P must be incomparable. Thus, each value of one objective can correspond to at most one solution in P. Because the solutions with |x| 2k have + value on the first objective, they must be excluded from P, and thus, |x| {0, 1, . . . , 2k 1}, implying |P| 2k. We can now conclude that the probability of increasing Jmax in one iteration is at least (1/(4k2)) (2/(en) 1/(en)2) = Ω(1/(k2n)). The above analysis shows that Jmax will not decrease, but can increase with probability Ω(1/(k2n)) in one iteration. Thus, the expected number of iterations until Jmax increases by at least 1 is O(k2n). For Jmax = k, it requires to increase Jmax by at most k times, implying that the expected number of iterations until finding a solution with the desired approximation guarantee is O(k3n), which is polynomial. Next, by an illustrative example of subset selection, we prove that PORSS can perform much better than the greedy algorithm and POSS. As presented in Definition 5, the best subset of size (i + 1) can be generated from the best subset of size i by adding one specific item, and the only exception is the best subset of size k, i.e., the optimal solution, which differs greatly from the best subsets of other sizes. This example represents subset selection problems where decisions have to be made in sequence to some extent. Definition 5. The objective function f satisfies that (1) 0 i n 1 : f(xi) < f(xi+1); (2) if |x| = i = k, then x = xi : f(x) < f(xi); (3) if |x|=k, then x / {x , xk}:f(x)