# subset_selection_by_pareto_optimization_with_recombination__f14e00e8.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Subset Selection by Pareto Optimization with Recombination

Chao Qian,1 Chao Bian,1 Chao Feng2

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China qianc@lamda.nju.edu.cn, chaobian12@gmail.com, chaofeng@mail.ustc.edu.cn

Subset selection, i.e., to select a limited number of items optimizing some given objective function, is a fundamental problem with various applications such as unsupervised feature selection and sparse regression. By employing a multi-objective evolutionary algorithm (EA) with mutation only to optimize the given objective function and minimize the number of selected items simultaneously, the recently proposed POSS algorithm achieves state-of-the-art performance for subset selection. In this paper, we propose the PORSS algorithm by incorporating recombination, a characterizing feature of EAs, into POSS. We prove that PORSS can achieve the optimal polynomial-time approximation guarantee as POSS when the objective function is monotone, and can ﬁnd an optimal solution efﬁciently in some cases whereas POSS cannot. Extensive experiments on unsupervised feature selection and sparse regression show the superiority of PORSS over POSS. Our analysis also theoretically discloses that recombination from diverse solutions can be more likely than mutation alone to generate various variations, thereby leading to better exploration; this may be of independent interest for understanding the inﬂuence of recombination.

Introduction

This paper considers a general problem, i.e., subset selection, which is to select a subset of size at most k from a total set of n items for maximizing (or minimizing) some given objective function f. This problem arises in various real-world applications, such as maximum coverage (Feige 1998), sparse regression (Miller 2002), inﬂuence maximization (Kempe, Kleinberg, and Tardos 2003), sensor placement (Krause, Singh, and Guestrin 2008), document summarization (Lin and Bilmes 2011) and unsupervised feature selection (Farahat, Ghodsi, and Kamel 2011), to name a few. Subset selection is generally NP-hard, and much efforts have been devoted to developing polynomial-time approximation algorithms. The greedy algorithm, which iteratively selects one item with the largest marginal gain, has been shown to be a good approximation solver. When the involved objective function f satisﬁes the monotone property, the greedy algorithm can achieve the (1 e γ)-

This work was supported by the NSFC (61603367). Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

approximation guarantee, where γ is the submodularity ratio measuring how close f is to submodularity (Das and Kempe 2011). Particularly, for submodular objective functions, γ = 1 and the approximation guarantee becomes 1 1/e, which is optimal, i.e., cannot be improved by any polynomial-time algorithm (Nemhauser and Wolsey 1978). Harshaw et al. (2019) have recently proved that the general approximation guarantee of (1 e γ) is also optimal. Based on Pareto optimization, Qian et al. (2015) proposed the POSS algorithm for subset selection. The idea is to reformulate subset selection as a bi-objective optimization problem maximizing the given objective and minimizing the subset size simultaneously, then solve the problem by a multiobjective EA (MOEA), and ﬁnally select the best solution with size at most k from the generated solution set. It has been shown that POSS can achieve the optimal polynomialtime approximation guarantee, 1 e γ, and can be signiﬁcantly better than the greedy algorithm in applications, e.g., unsupervised feature selection and sparse regression. Moreover, POSS is robust against uncertainties (Qian et al. 2017; Roostapour et al. 2019), and easily distributed for large-scale tasks (Qian et al. 2016; 2018; Qian 2019). The optimization engine of POSS is the employed MOEA, which iteratively reproduces new solutions for solving the reformulated bi-objective problem. For EAs, mutation and recombination (or called crossover) are two popular operators for reproduction (B ack 1996); the former changes one solution randomly whereas the latter mixes up two or more solutions. POSS applies mutation only and has performed well, while recombination, as a core feature of EAs, may be helpful to further improve its performance. In this paper, we propose the PORSS algorithm for subset selection by introducing recombination into POSS. Two common recombination operators are considered: one-point recombination and uniform recombination. In theory, we prove that for subset selection with monotone objective functions, PORSS can achieve the optimal polynomial-time approximation guarantee, 1 e γ; for one concrete example of subset selection, PORSS can be signiﬁcantly faster than POSS to ﬁnd an optimal solution. We also conduct experiments on the applications of unsupervised feature selection and sparse regression with various real-world data sets, showing that within the same running time, PORSS can almost always achieve better performance than POSS.

Note that recombination is understood only at preliminary level, though there are great efforts devoted to analyzing its inﬂuence, e.g., (Neumann and Theile 2010; Doerr et al. 2013; Qian, Yu, and Zhou 2013; Oliveto and Witt 2014; Sudholt 2017; Dang et al. 2018). Our analysis theoretically discloses that recombining diverse solutions is more likely than mutation to generate various variations, and thus to escape from local optima; this may help to understand this kind of operator.

Subset Selection Given a ground set V = {v1, v2, . . . , vn}, we study the functions f : 2V R over subsets of V . A set function f is monotone if S T, f(S) f(T). Assume w.l.o.g. that monotone functions are normalized, i.e., f( ) = 0. A set function f is submodular (Nemhauser, Wolsey, and Fisher 1978) if S T V ,

v T \S f(S {v}) f(S) .

For a general set function f, the notion of submodularity ratio in Deﬁnition 1 is used to measure to what extent f has the submodular property. When f is monotone, it holds that (1) S, l : 0 γS,l(f) 1, and (2) f is submodular iff S, l : γS,l(f) = 1.

Deﬁnition 1 (Submodularity Ratio (Das and Kempe 2011)). The submodularity ratio of a set function f : 2V R with respect to a set S V and a parameter l 1 is

γS,l(f) = min L S,T :|T | l,T L=

v T (f(L {v}) f(L))

f(L T) f(L) .

The subset selection problem as presented in Deﬁnition 2 is to select a subset S of V such that a given objective f is maximized with the constraint |S| k. For a monotone function f, the greedy algorithm, which iteratively adds one item with the largest marginal gain until k items are selected, can achieve an approximation guarantee of (1 e γS,k(f)) (Das and Kempe 2011), where S is the subset output by the greedy algorithm. The optimality of this approximation guarantee was known only in the case where γS,k(f) = 1, i.e., f is submodular (Nemhauser and Wolsey 1978), and has recently been proved in the general case (Harshaw et al. 2019).

Deﬁnition 2 (Subset Selection). Given all items V = {v1, v2, . . . , vn}, an objective function f and a budget k, to ﬁnd a subset of at most k items maximizing f, i.e.,

arg max S V f(S) s.t. |S| k. (1)

Here are two applications of subset selection with monotone, but not necessarily submodular, objective functions, that will be studied in this paper. Unsupervised feature selection as presented in Deﬁnition 3 is to select at most k columns from a matrix A to best approximate A. Some notations: ( )+: Moore-Penrose inverse of a matrix; F : Frobenius norm of a matrix; | |: number of columns of a matrix. The goodness of approximation is measured by the sum of squared errors between the original matrix A and the

approximation SS+A, where SS+ is the projection matrix onto the space spanned by the columns of S. Note that a submatrix of A can be seen as a subset of all columns of A.

Deﬁnition 3 (Unsupervised Feature Selection). Given a matrix A Rm n and a budget k, to ﬁnd a submatrix S of A with at most k columns minimizing A SS+A 2 F , i.e.,

arg min S: a submatrix of A A SS+A 2 F s.t. |S| k.

For the ease of theoretical treatment, this minimization problem is often equivalently reformulated as a maximization problem (Bhaskara et al. 2016; Ordozgoiti, Canaval, and Mozo 2018):

arg max S: a submatrix of A SS+A 2 F s.t. |S| k.

Sparse regression (Miller 2002) as presented in Deﬁnition 4 is to ﬁnd a sparse approximation solution to the linear regression problem. Note that S and its index set {i | vi S} are not distinguished for notational convenience, and all variables are assumed w.l.o.g. to be normalized to have expectation 0 and variance 1.

Deﬁnition 4 (Sparse Regression). Given observation variables V = {v1, . . . , vn}, a predictor variable z and a budget k, to ﬁnd at most k variables from V maximizing the squared multiple correlation (Johnson and Wichern 2007), i.e.,

arg max S V R2 z,S = 1 MSEz,S s.t. |S| k,

where MSEz,S denotes the mean squared error, i.e.,

MSEz,S = minα R|S| E[ z

i S αivi 2 ].

The POSS Algorithm Based on Pareto optimization, a new algorithm POSS for subset selection has been proposed (Friedrich and Neumann 2015; Qian, Yu, and Zhou 2015). Note that a subset S of V can be represented by a binary vector x {0, 1}n, where xi = 1 iff the item vi S, and we will not distinguish them for notational convenience. POSS reformulates the original problem Eq. (1) as a bi-objective minimization problem:

arg minx {0,1}n (f1(x), f2(x)), (2)

f1(x) = + , |x| 2k f(x), otherwise , f2(x) = |x|.

Thus, POSS maximizes the original objective function f and minimizes the subset size |x| simultaneously. By setting f1 to + for |x| 2k, overly infeasible solutions, i.e., solutions with large constraint violation, are excluded. To compare solutions in bi-objective optimization, POSS uses the domination relationship. For two solutions x and x , x weakly dominates x , denoted as x x , if f1(x) f1(x ) f2(x) f2(x ); x dominates x , denoted as x x , if x x and either f1(x) < f1(x ) or f2(x) < f2(x ); they are incomparable, if neither x x nor x x. POSS employs a simple MOEA with mutation only, which is slightly modiﬁed from the GSEMO algorithm (Giel

Algorithm 1 POSS Algorithm Input: V = {v1, . . . , vn}; objective f : 2V R; budget k Parameter: the number T of iterations Output: a subset of V with at most k items Process: 1: Let x = 0n, P = {x} and t = 0; 2: while t < T do 3: Select x from P randomly; 4: Apply bit-wise mutation on x to generate x ; 5: if z P such that z x then 6: P = (P \ {z P | x z}) {x } 7: end if 8: t = t + 1 9: end while 10: return arg maxx P,|x| k f(x)

2003; Laumanns, Thiele, and Zitzler 2004), to solve the biobjective problem Eq. (2). As described in Algorithm 1, it starts from 0n representing the empty set, and iteratively tries to improve the solutions in the population P (lines 2-9). In each iteration, a solution x is selected from P uniformly at random, and used to generate a new solution x by the bit-wise mutation operator, presented as follows: Bit-wise mutation: ﬂip each bit of a solution x {0, 1}n independently with probability 1/n. The newly generated solution x is then used to update P in lines 5-7, making P contain only non-dominated solutions generated-so-far. That is, if x is not dominated by any solution in P (line 5), it will be added into P, and meanwhile those archived solutions weakly dominated by x will be deleted (line 6). After running T iterations, the best solution w.r.t. the original problem Eq. (1) is selected from P in line 10 as the ﬁnal output solution. For subset selection with monotone objective functions, POSS has been proved to achieve the same general approximation guarantee as the greedy algorithm in polynomial expected running time, i.e., to achieve the optimal polynomialtime approximation guarantee (Qian, Yu, and Zhou 2015). Furthermore, it has been empirically shown that POSS can achieve signiﬁcantly better performance than the greedy algorithm in some applications, e.g., unsupervised feature selection (Feng, Qian, and Tang 2019) and sparse regression (Qian, Yu, and Zhou 2015).

The PORSS Algorithm To reproduce new solutions in each iteration, POSS applies the mutation operator, which simulates the mutation phenomena in DNA transformation. It is known that recombination is another popular operator for reproduction, which simulates the chromosome exchange phenomena in zoogamy reproduction, and typically appears in various real EAs, e.g., the popularly used algorithm NSGA-II (Deb et al. 2002). In this section, we propose a new Pareto Optimization algorithm with Recombination for Subset Selection, brieﬂy called PORSS. As described in Algorithm 2, PORSS employs recombination and mutation together, rather than mutation only, to generate new solutions in each iteration. In

Algorithm 2 PORSS Algorithm Input: V = {v1, . . . , vn}; objective f : 2V R; budget k Parameter: the number T of iterations Output: a subset of V with at most k items Process: 1: Let x = 0n, P = {x} and t = 0; 2: while t < T do 3: Select x, y from P randomly with replacement; 4: Apply recombination on x, y to generate x , y ; 5: Apply bit-wise mutation on x , y to generate x , y ; 6: for each q {x , y } 7: if z P such that z q then 8: P = (P \ {z P | q z}) {q} 9: end if 10: end for 11: t = t + 1 12: end while 13: return arg maxx P,|x| k f(x)

line 3, two solutions are selected randomly from the population P with replacement, and then recombined to generate new solutions in line 4. We consider two commonly used recombination operators: One-point recombination: select i {1, 2, . . . , n} randomly, and exchange the ﬁrst i bits of two solutions; Uniform recombination: exchange each bit of two solutions independently with probability 1/2. For example, for solutions 0n and 1n, two new solutions 0n/21n/2 and 1n/20n/2 can be generated by one-point recombination with probability 1/n, and by uniform recombination with probability (1/2n) 2, where the factor 2 is included due to the symmetry. In line 5, the two solutions generated by recombination are further mutated to generate another two ones, which are used to update the population P in lines 6-10.

Inﬂuence of Recombination To understand the inﬂuence of recombination intuitively, we compare the distribution of the number of bits ﬂipped with/without recombination. Suppose two solutions x, y selected in line 3 of Algorithm 2 have Hamming distance d, denoted as H(x, y) = d. Let x , y denote the two solutions generated by recombination and mutation in line 5. We analyze the probability for x or y to have Hamming distance j with x, denoted as Q(d, j). That is,

Q(d, j) = P(H(x, x )=j H(x, y )=j | H(x, y)=d).

For one-point and uniform recombination, we use the notations Qo(d, j) and Qu(d, j), respectively. Let zm denote the solution generated from a solution z by bit-wise mutation. By turning off recombination, i.e., deleting line 4 of Algorithm 2, we analyze the corresponding probability

Qm(d, j)=P(H(x, xm)=j H(x, ym)=j|H(x, y)=d).

In the following, we compare Qo(d, j), Qu(d, j) with Qm(d, j), to examine the inﬂuence of recombination.

Given a solution z with Hamming distance i from x, let qi,j denote the probability for the Hamming distance to become j by bit-wise mutation, i.e.,

qi,j = P(H(x, zm) = j | H(x, z) = i).

For Qm(d, j), as H(x, x) = 0 and H(x, y) = d, we have

Qm(d, j) = q0,j + qd,j q0,j qd,j.

By uniform recombination, x, y exchange i different bits with probability d i (1/2)i(1/2)d i = d i (1/2)d, generating two solutions x , y where H(x, x ) = i and H(x, y ) = d i. Note that x and y have totally d different bits. Considering the mutation behavior on x , y , we have

2d (qi,j + qd i,j qi,j qd i,j) .

Consider one-point recombination. 1 i d, there exists l i such that exchanging the ﬁrst l bits of x, y can generate two solutions x , y where H(x, x ) = i and H(x, y ) = d i. Thus, we have

i=1 (qi,j + qd i,j qi,j qd i,j).

Because it is sufﬁcient to keep all bits unchanged in mutation, qj,j (1 1/n)n 1/(2e). Thus, we have

(a) j d : Qo(d, j) 1/n q(j, j) = Ω(1/n);

(b) j d : Qu(d, j) d j

2d q(j, j)=Ω d j

By analyzing q0,j and qd,j, we can derive that, 0 < j0 d,

(c.1) for j < j0 : Ω (1/j)j Qm(d, j) O (e/j)j ;

(c.2) for j j0 :

d j Qm(d, j) O e d j d

The detailed analysis for Qm(d, j) is provided in the supplementary material due to space limitations. According to (c.1) and (c.2), the number of bits ﬂipped by mutation only is strongly concentrated around two extreme values, 0 and d. When j increases from 0 to j0 or decreases from d to j0, Qm(d, j) decays super-exponentially. Particularly, for j = d/2, q(0, d/2) n d/2 (1/n)d/2 1/(d/2)!,

q(d, d/2) d d/2 (1/n)d/2 1/(d/2)!, and thus,

Qm(d, d/2) 2 (d/2)! 2 e(d/(2e))d/2 2e

where the second inequality holds by Stirling s formula. According to (b), the number of bits ﬂipped by uniform recombination and mutation is concentrated around d/2, but Qu(d, j) is always lower bounded by Ω(1/2d), which is much greater than Qm(d, d/2) in Eq. (3) when d is large. According to (a), j d : Qo(d, j) Ω(1/n), implying that the number of bits ﬂipped by one-point recombination and mutation is relatively uniformly distributed.

Therefore, from diverse solutions, i.e., when d is large, recombination can ease ﬂipping any number of bits, and may lead to better exploration and thus a better ability of escaping from local optima. The advantage of recombination will be veriﬁed by theoretical analysis and empirical study.

Theoretical Analysis As introduced before, the greedy algorithm and POSS can achieve the optimal polynomial-time approximation guarantee for subset selection with monotone objective functions. A natural question is whether PORSS can keep the optimal approximation. We give the positive answer by proving Theorem 1, i.e., PORSS achieves the approximation guarantee of (1 e γmin) in polynomial expected running time. Let OPT denote the optimal function value. The proof is inspired by the analysis of POSS (Qian, Yu, and Zhou 2015). Lemma 1. (Qian et al. 2016) Let f : {0, 1}n R+ be a monotone function. For any x {0, 1}n, there exists one item v / x such that

f(x {v}) f(x) γx,k

k (OPT f(x)).

Theorem 1. For subset selection with any monotone f, the expected number of iterations until PORSS with one-point or uniform recombination ﬁnds a solution x with |x| k and f(x) (1 e γmin) OPT is polynomial, where γmin = minx:|x|=k 1 γx,k.

Proof. Let Jmax be the maximum value of j {0, 1, . . . , k} such that in the population P, there exists a solution x with |x| j and f(x) (1 (1 γmin/k)j) OPT. That is,

Jmax = max{j {0, 1, . . . , k} | x P :

|x| j f(x) (1 (1 γmin/k)j) OPT}.

We only need to analyze the expected number of iterations until Jmax = k, which implies that there exists one solution x P satisfying that |x| k and f(x) (1 (1 γmin/k)k) OPT (1 e γmin) OPT. As PORSS starts from 0n, Jmax is initially 0. Assume that currently Jmax = i < k. Let x denote a solution corresponding to Jmax = i, i.e., |x| i and f(x) (1 (1 γmin/k)i) OPT. First, Jmax will not decrease. This is because deleting x from P in line 8 of Algorithm 2 implies that x is weakly dominated by the newly included solution q, satisfying that |q| |x| i and f(q) f(x) (1 (1 γmin/k)i) OPT. Next, we analyze the probability of increasing Jmax in one iteration. Consider the case that the two selected solutions in line 3 of Algorithm 2 are both x, occurring with probability (1/|P|) (1/|P|) due to uniform selection with replacement. For two identical solutions, either one-point or uniform recombination in line 4 makes no changes. Thus, in line 5, x is used to generate a new solution by bit-wise mutation, and this process is implemented twice independently. For bit-wise mutation on x, according to Lemma 1, a new solution x satisfying f(x ) f(x) (γx,k/k) (OPT f(x)) can be generated by ﬂipping only one speciﬁc 0 bit of x (i.e., adding a speciﬁc item into x), occurring with probability (1/n)(1 1/n)n 1 1/(en). As

f(x) (1 (1 γmin/k)i) OPT, we have

f(x ) (1 γx,k/k) f(x) + (γx,k/k) OPT

(1 (1 γx,k/k)(1 γmin/k)i) OPT

(1 (1 γmin/k)i+1) OPT.

Note that the last inequality holds by γx,k γmin, because |x| < k and γx,k decreases with x. As x is mutated twice independently in line 5, such a new solution x can be generated with probability at least 1 (1 1/(en))2 = 2/(en) 1/(en)2. It is clear that |x | = |x|+1 i+1. Then, x will be added into P; otherwise, x must be dominated by one archived solution in line 7 of Algorithm 2, and this implies that Jmax has been larger than i, contradicting with the assumption Jmax = i. After adding x into P, Jmax i+1. Thus, Jmax can increase by at least 1 in one iteration with probability at least (1/|P|2) (2/(en) 1/(en)2). By the procedure of updating the population P in lines 6-10, the solutions in P must be incomparable. Thus, each value of one objective can correspond to at most one solution in P. Because the solutions with |x| 2k have + value on the ﬁrst objective, they must be excluded from P, and thus, |x| {0, 1, . . . , 2k 1}, implying |P| 2k. We can now conclude that the probability of increasing Jmax in one iteration is at least (1/(4k2)) (2/(en) 1/(en)2) = Ω(1/(k2n)). The above analysis shows that Jmax will not decrease, but can increase with probability Ω(1/(k2n)) in one iteration. Thus, the expected number of iterations until Jmax increases by at least 1 is O(k2n). For Jmax = k, it requires to increase Jmax by at most k times, implying that the expected number of iterations until ﬁnding a solution with the desired approximation guarantee is O(k3n), which is polynomial.

Next, by an illustrative example of subset selection, we prove that PORSS can perform much better than the greedy algorithm and POSS. As presented in Deﬁnition 5, the best subset of size (i + 1) can be generated from the best subset of size i by adding one speciﬁc item, and the only exception is the best subset of size k, i.e., the optimal solution, which differs greatly from the best subsets of other sizes. This example represents subset selection problems where decisions have to be made in sequence to some extent. Deﬁnition 5. The objective function f satisﬁes that

(1) 0 i n 1 : f(xi) < f(xi+1);

(2) if |x| = i = k, then x = xi : f(x) < f(xi);

(3) if |x|=k, then x / {x , xk}:f(x)<f(xk)<f(x ),

where xi = 1i0n i, x = 0k1k0n 2k and k n/2. It is clear that the optimal solution is x = 0k1k0n 2k, and each xi = 1i0n i is the best solution for size i except that xk = 1k0n k is the runner-up for size k. Due to the greedy nature, the greedy algorithm ﬁnds x0, x1, . . . , xk sequentially, implying that x cannot be found. Lemma 2. For subset selection with f in Deﬁnition 5, the greedy algorithm cannot ﬁnd the optimal solution. Lemmas 3 to 5 show the expected number of iterations of POSS and PORSS until ﬁnding the optimal solution. The

detailed proofs are provided in the supplementary material due to space limitations, and we introduce the proof intuition here. Both POSS and PORSS can ﬁnd the solutions {x0, x1, . . . , x2k 1} efﬁciently. After that, the population will always consist of these solutions before ﬁnding the optimal one. Because the Hamming distance between xi and x is at least k, the probability of generating the optimal solution x by mutation is at most (1/n)k, and thus, POSS is inefﬁcient. For PORSS, recombination from the two diverse solutions x0 = 0n and x2k 1 = 12k 10n 2k+1 can generate the solution 0k1k 10n 2k+1 through exchanging their ﬁrst k bits, occurring with a large probability, i.e., 1/n or (1/22k 1) 2, by one-point or uniform recombination; the subsequent mutation operator can generate x by ﬂipping only one speciﬁc bit, i.e., the (2k)-th bit, occurring with probability (1/n)(1 1/n)n 1. Thus, PORSS can be efﬁcient. Note that the reason for the effectiveness of recombination is consistent with that found in the last section. Lemma 3. For subset selection with f in Deﬁnition 5, the expected number of iterations until POSS ﬁnds the optimal solution is at least (n/(3e4ek))k. Lemma 4. For subset selection with f in Deﬁnition 5, the expected number of iterations until PORSS with one-point recombination ﬁnds the optimal solution is at most 6ek2n2. Lemma 5. For subset selection with f in Deﬁnition 5, the expected number of iterations until PORSS with uniform recombination ﬁnds the optimal solution is at most 5k24kn. Let E[Tm], E[To] and E[Tu] denote the expected number of iterations of POSS, PORSS with one-point and uniform recombination, respectively, for ﬁnding the optimal solution. Considering that the budget k is usually not large in real applications, we make the following observations. Remark 1. According to Lemmas 3 to 5, we have

(1) if k O(1), then E[Tm] Ω(nk),

E[To] O(n2) and E[Tu] O(n);

(2) if ω(1) k (log n)/20, then E[Tm] nk/4,

E[To] n2(log n)2 and E[Tu] n1.1(log n)2.

In other words, when k is a constant, i.e., O(1), PORSS is polynomially faster than POSS, and the gap increases with k; when k continues to increase to ω(1), the gap becomes super-polynomially large.

Empirical Study In this section, we empirically compare PORSS, POSS and the greedy algorithm on the applications of unsupervised feature selection and sparse regression with various realworld data sets.1 PORSS using one-point and uniform recombination are denoted by PORSSo and PORSSu, respectively. Note that some common algorithms, e.g., Iter FS (Ordozgoiti, Canaval, and Mozo 2018) for unsupervised feature selection and lasso (Tibshirani 1996) for sparse regression,

1https://archive.ics.uci.edu/ml/datasets.html, https://www.csie. ntu.edu.tw/ cjlin/libsvmtools/datasets/ and http://www.cl.cam.ac. uk/research/dtg/attarchive/facedatabase.html.

Table 1: Unsupervised feature selection: the error ratio (the smaller, the better) of the compared algorithms on ten data sets for k = 8. The mean std. is reported for randomized algorithms. In each data set, the smallest values are bolded. The count of direct win denotes the number of data sets on which POSS has a smaller error ratio than the corresponding algorithm (1 tie is counted as 0.5 win), where signiﬁcant cells by the sign-test with conﬁdence level 0.05 are bolded.

Data set (#inst, #feat) OPT Greedy POSS PORSSo PORSSu sonar (208, 60) 1.353 1.429 1.371 0.007 1.358 0.006 1.363 0.010 phishing (11055, 68) 1.166 1.223 1.168 0.006 1.166 0.000 1.167 0.003 Hill-Valley (606, 100) 1.544 1.543 0.043 1.492 0.058 1.511 0.029 mediamill (30993, 120) 1.732 1.604 0.027 1.579 0.018 1.559 0.022 musk (7074, 168) 1.178 1.169 0.006 1.168 0.005 1.168 0.005 CT-slices (53500, 386) 1.242 1.240 0.003 1.235 0.002 1.234 0.002 ISOLET (7797, 617) 1.192 1.192 0.002 1.189 0.001 1.189 0.000 mnist (10000, 780) 1.352 1.332 0.005 1.326 0.003 1.325 0.003 SVHN (73257, 3072) 1.446 1.420 0.005 1.402 0.009 1.398 0.005 ORL (400, 10304) 1.280 1.270 0.007 1.259 0.005 1.252 0.004 POSS: Count of direct win 9.5 - 0 0 Average rank 3.95 3.05 1.60 1.40

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

(b) phishing

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

(c) Hill-Valley

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

(d) mediamill

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

(f) CT-slices

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Error ratio

Greedy POSS PORSSo PORSSu

Figure 1: Error ratio (the smaller, the better) vs. running time of POSS, PORSSo and PORSSu on unsupervised feature selection.

are not compared, because POSS has been shown to be better (Qian, Yu, and Zhou 2015; Feng, Qian, and Tang 2019). As suggested in (Qian, Yu, and Zhou 2015), the number T of iterations of POSS is set to 2ek2n . Note that POSS in Algorithm 1 requires one objective evaluation for the newly generated solution x in each iteration, whereas PORSS in Algorithm 2 needs to evaluate two new solutions x , y . For the fairness of comparison, the number T of iterations of PORSS is set to ek2n; thus, the same number of objective evaluations is used. The budget k is set to 8. As POSS and PORSS are randomized algorithms, we repeat the running for ten times independently and report the average f values. Unsupervised Feature Selection. To evaluate a submatrix S, we measure the ratio of its reconstruction error in Deﬁnition 3 w.r.t. the smallest rank-k approximation error by SVD:

error ratio = A SS+A 2 F / A Ak 2 F ,

where Ak denotes the best rank-k approximation to A via SVD. The error ratio is larger than 1, and the smaller the better. The results are summarized in Table 1. Note that the

standard deviation of error ratio is 0 sometimes (e.g., for PORSSo on the phishing data set), which is because the same good solution is found in ten runs. We can see that the best performance on each data set is always achieved by PORSSo or PORSSu. By the sign-test (Demˇsar 2006) with conﬁdence level 0.05, POSS is signiﬁcantly better than the greedy algorithm, consistent with the previous results (Feng, Qian, and Tang 2019), and signiﬁcantly worse than PORSSo and PORSSu, showing the usefulness of recombination. The rank of each algorithm on each data set is also computed as in (Demˇsar 2006), and averaged in the last row of Table 1. Sparse Regression. We use R2 z,S in Deﬁnition 4 to measure the goodness of a subset S of variables. The larger it is, the better. We can see from Table 2 that the algorithms have the similar performance rank as in unsupervised feature selection, i.e., PORSSo and PORSSu are signiﬁcantly better than POSS, and the greedy algorithm performs the worst. To have a more clear comparison, we select the greedy algorithm for the baseline, and plot the curve of error ratio or R2 over the running time for POSS, PORSSo and PORSSu, as shown in Figures 1 and 2. Note that the running time is

Table 2: Sparse regression: the R2 value (the larger, the better) of the compared algorithms on ten data sets for k = 8. The mean std. is reported for randomized algorithms. In each data set, the largest values are bolded. The count of direct win denotes the number of data sets on which POSS has a larger R2 value than the corresponding algorithm (1 tie is counted as 0.5 win), where signiﬁcant cells by the sign-test with conﬁdence level 0.05 are bolded.

Data set (#inst, #feat) OPT Greedy POSS PORSSo PORSSu svmguide3 (1243, 22) 0.221 0.214 0.220 0.001 0.220 0.001 0.221 0.001 triazines (186, 60) 0.328 0.316 0.327 0.000 0.328 0.000 0.328 0.000 clean1 (476, 166) 0.371 0.386 0.004 0.387 0.006 0.393 0.005 usps (7291, 256) 0.562 0.570 0.003 0.572 0.003 0.572 0.003 scene (1211, 294) 0.254 0.268 0.003 0.272 0.002 0.271 0.002 protein (17766, 356) 0.132 0.132 0.000 0.133 0.000 0.133 0.000 colon-cancer (62, 2000) 0.890 0.906 0.011 0.909 0.018 0.911 0.014 cifar10 (50000, 3072) 0.069 0.070 0.001 0.070 0.001 0.071 0.001 leukemia (72, 7129) 0.947 0.966 0.009 0.968 0.006 0.969 0.007 small NORB (24300, 18432) 0.461 0.535 0.007 0.547 0.003 0.550 0.002 POSS: Count of direct win 9.5 1 0 Average rank 3.95 2.95 1.85 1.25

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(a) svmguide3

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(b) triazines

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(f) protein

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(g) colon-cancer

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(h) cifar10

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(i) leukemia

5 10 15 20 25 30 35 40 Running time in kn

Greedy POSS PORSSo PORSSu

(j) small NORB

Figure 2: R2 (the larger, the better) vs. running time of POSS, PORSSo and PORSSu on sparse regression.

considered in the number of objective function evaluations, and one unit on the x-axis corresponds to kn evaluations, the running time of the greedy algorithm. It can be clearly observed that the curves of PORSSo and PORSSu are almost always below (above) that of POSS in Figure 1 (Figure 2), implying that PORSSo and PORSSu consistently outperform POSS during the running process. It is known that the greedy algorithm is an efﬁcient ﬁxed time algorithm, while PORSS is an anytime algorithm that can use more time to ﬁnd better solutions. In fact, we ﬁnd that PORSS can even be both better and faster, e.g., in Figures 1(h-j) and 2(j).

Note that the improvement of PORSS over POSS is small in several cases, which may be because POSS has performed very well. We compute the optimal solution by exhaustive enumeration, denoted as OPT. Due to the computation time limit, OPT is calculated only for the two smallest data sets in both applications. It can be seen from the second and third rows of Tables 1 and 2 that POSS achieves the nearly optimal solution, which also implies that PORSS can bring im-

CT-slices scene

6 7 8 9 10 k

Error ratio

Greedy POSS PORSSo PORSSu

6 7 8 9 10 k

Greedy POSS PORSSo PORSSu

(a) Unsupervised feature selection (b) Sparse regression

Figure 3: Comparison for budget k {6, 7, 8, 9, 10}.

provement even when POSS has been nearly optimal.

Finally, we examine the inﬂuence of budget k in Figure 3. The results for k {6, 7, 8, 9, 10} on the data set CT-slices for unsupervised feature selection and scene for sparse regression show that PORSS always performs the best.

Conclusion This paper proposes the PORSS algorithm for subset selection, based on Pareto optimization with recombination. The superiority of PORSS over state-of-the-art algorithms, i.e., POSS and the greedy algorithm, is shown by theoretical analysis, as well as empirical study on the applications of unsupervised feature selection and sparse regression. Theoretical analysis also provides insight on the effect of recombination, which may be helpful for designing improved EAs.

References B ack, T. 1996. Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Press. Bhaskara, A.; Rostamizadeh, A.; Altschuler, J.; Zadimoghaddam, M.; Fu, T.; and Mirrokni, V. 2016. Greedy column subset selection: New bounds and distributed algorithms. In Proceedings of the 33rd International Conference on Machine Learning (ICML 16), 2539 2548. Dang, D.-C.; Friedrich, T.; K otzing, T.; Krejca, M.; Lehre, P. K.; Oliveto, P. S.; Sudholt, D.; and Sutton, A. 2018. Escaping local optima using crossover with emergent diversity. IEEE Transactions on Evolutionary Computation 22(3):484 497. Das, A., and Kempe, D. 2011. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In Proceedings of the 28th International Conference on Machine Learning (ICML 11), 1057 1064. Deb, K.; Pratap, A.; Agarwal, S.; and Meyarivan, T. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2):182 197. Demˇsar, J. 2006. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research 7:1 30. Doerr, B.; Johannsen, D.; K otzing, T.; Neumann, F.; and Theile, M. 2013. More effective crossover operators for the all-pairs shortest path problem. Theoretical Computer Science 471:12 26. Farahat, A. K.; Ghodsi, A.; and Kamel, M. S. 2011. An efﬁcient greedy method for unsupervised feature selection. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 11), 161 170. Feige, U. 1998. A threshold of ln n for approximating set cover. Journal of the ACM 45(4):634 652. Feng, C.; Qian, C.; and Tang, K. 2019. Unsupervised feature selection by Pareto optimization. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence (AAAI 19), 3534 3541. Friedrich, T., and Neumann, F. 2015. Maximizing submodular functions under matroid constraints by evolutionary algorithms. Evolutionary Computation 23(4):543 558. Giel, O. 2003. Expected runtimes of a simple multi-objective evolutionary algorithm. In Proceedings of the 2003 IEEE Congress on Evolutionary Computation (CEC 03), 1918 1925. Harshaw, C.; Feldman, M.; Ward, J.; and Karbasi, A. 2019. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. In Proceedings of the 36th International Conference on Machine Learning (ICML 19), 2634 2643. Johnson, R. A., and Wichern, D. W. 2007. Applied Multivariate Statistical Analysis. Pearson, 6th edition. Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of inﬂuence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 03), 137 146.

Krause, A.; Singh, A.; and Guestrin, C. 2008. Near-optimal sensor placements in Gaussian processes: Theory, efﬁcient algorithms and empirical studies. Journal of Machine Learning Research 9:235 284. Laumanns, M.; Thiele, L.; and Zitzler, E. 2004. Running time analysis of multiobjective evolutionary algorithms on pseudo Boolean functions. IEEE Transactions on Evolutionary Computation 8(2):170 182. Lin, H., and Bilmes, J. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 11), 510 520. Miller, A. 2002. Subset Selection in Regression. Chapman and Hall/CRC, 2nd edition. Nemhauser, G. L., and Wolsey, L. A. 1978. Best algorithms for approximating the maximum of a submodular set function. Mathematics of Operations Research 3(3):177 188. Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions I. Mathematical Programming 14(1):265 294. Neumann, F., and Theile, M. 2010. How crossover speeds up evolutionary algorithms for the multi-criteria all-pairs-shortest-path problem. In Proceedings of the 11th International Conference on Parallel Problem Solving from Nature (PPSN 10), 667 676. Oliveto, P. S., and Witt, C. 2014. On the runtime analysis of the simple genetic algorithm. Theoretical Computer Science 545:2 19. Ordozgoiti, B.; Canaval, S.; and Mozo, A. 2018. Iterative column subset selection. Knowledge and Information Systems 54(1):65 94. Qian, C.; Shi, J.-C.; Yu, Y.; Tang, K.; and Zhou, Z.-H. 2016. Parallel Pareto optimization for subset selection. In Proceedings of the 25th International Joint Conference on Artiﬁcial Intelligence (IJCAI 16), 1939 1945. Qian, C.; Shi, J.-C.; Yu, Y.; Tang, K.; and Zhou, Z.-H. 2017. Subset selection under noise. In Advances in Neural Information Processing Systems 30 (NIPS 17), 3562 3572. Qian, C.; Li, G.; Feng, C.; and Tang, K. 2018. Distributed Pareto optimization for subset selection. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence (IJCAI 18), 1492 1498. Qian, C.; Yu, Y.; and Zhou, Z.-H. 2013. An analysis on recombination in multi-objective evolutionary optimization. Artiﬁcial Intelligence 204:99 119. Qian, C.; Yu, Y.; and Zhou, Z.-H. 2015. Subset selection by Pareto optimization. In Advances in Neural Information Processing Systems 28 (NIPS 15), 1765 1773. Qian, C. 2019. Distributed Pareto optimization for large-scale noisy subset selection. IEEE Transactions on Evolutionary Computation in press. Roostapour, V.; Neumann, A.; Neumann, F.; and Friedrich, T. 2019. Pareto optimization for subset selection with dynamic cost constraints. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence (AAAI 19), 2354 2361. Sudholt, D. 2017. How crossover speeds up building block assembly in genetic algorithms. Evolutionary Computation 25(2):237 274. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267 288.