# evolving_solutions_to_communitystructured_satisfiability_formulas__2a3e9ddb.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Evolving Solutions to Community-Structured Satisﬁability Formulas

Frank Neumann Optimisation and Logistics School of Computer Science The University of Adelaide Adelaide, Australia

Andrew M. Sutton Department of Computer Science University of Minnesota Duluth Duluth, MN, USA

We study the ability of a simple mutation-only evolutionary algorithm to solve propositional satisﬁability formulas with inherent community structure. We show that the community structure translates to good ﬁtness-distance correlation properties, which implies that the objective function provides a strong signal in the search space for evolutionary algorithms to locate a satisfying assignment efﬁciently. We prove that when the formula clusters into communities of size s ω(log n) O(nε/(2ε+2)) for some constant 0 < ε < 1, and there is a nonuniform distribution over communities, a simple evolutionary algorithm called the (1+1) EA ﬁnds a satisfying assignment in polynomial time on a 1 o(1) fraction of formulas with at least constant constraint density. This is a signiﬁcant improvement over recent results on uniform random formulas, on which the same algorithm has only been proven to be efﬁcient on uniform formulas of at least logarithmic density.

Introduction

Evolutionary algorithms are a broad class of incomplete randomized search heuristics that are sometimes applied to tackling difﬁcult optimization problems that arise in practice. We study the performance of a simple evolutionary algorithm tasked with ﬁnding a satisfying assignment to structured (non-uniform) propositional formulas expressed as the conjunction of m Boolean disjunctive clauses of size exactly k over n variables. A series of recent papers (Sutton and Neumann 2014; Doerr, Neumann, and Sutton 2017; Buzdalov and Doerr 2017) have derived rigorous running time bounds on a variety of evolutionary algorithms applied to solving propositional formulas generated uniformly at random. One criticism of the uniform random model is that it inadequately characterizes propositional formulas that arise from practical applications. Industrial SAT instances are far more complex, and contain complicated structural characteristics. Among these are modularity (Gir aldez-Cru and Levy 2016), heterogeneity (Ans otegui, Bonet, and Levy 2009), self-similarity (Ans otegui et al. 2014), and locality (Gir aldez-Cru and Levy 2017). Several non-uniform for-

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

mula distributions have been proposed to model the above characteristics better. In this paper, we are interested in extending the analysis to structured formulas that exhibit features that are statistically closer to industrial satisﬁability problems. We conduct an analysis of the (1+1) EA over a generalization of the community attachment model of Gir aldez-Cru and Levy (2016) for modular SAT formulas. Figure 1 illustrates the constraint graph structure for an industrial formula (from bounded model checking), a formula generated by the community attachment model, and a formula generated uniformly at random. We show that the (1+1) EA can solve in polynomial time a 1 o(1) fraction of satisﬁable formulas drawn from the community attachment model with constraint density (average degree) that is at least a sufﬁciently large constant, provided that the density within communities is nonuniform. This bound covers a larger region of the constraint density spectrum than previous results on evolutionary algorithms on the uniform random model. In particular, analogous results were proved for the (1+1) EA on the uniform random model, but only for formulas of density at least Ω(log n) (Doerr, Neumann, and Sutton 2017). For polylogarithmic densities, a more complicated evolutionary algorithm called the (1+(λ,λ)) GA yields a log n speed up on formulas with density ω(log2 n), and even further improvements on asymptotically larger densities when the population size is adapted (Buzdalov and Doerr 2017). Randomized search heuristics such as Walk SAT (Selman, Kautz, and Cohen 1994), Sch oning s algorithm (Sch oning 1999) and evolutionary algorithms are incomplete in the sense that they can only locate a satisfying assignment if it exists, and not decide an unsatisﬁable formula. A performance guarantee on satisﬁable formulas for such algorithms allows for the construction of a Monte Carlo decision algorithm with one-sided error (Sutton and Neumann 2014). It therefore sufﬁces to derive time bounds only on satisﬁable formulas. Ideally, given an instance distribution, one would like to derive performance guarantees over the distribution conditioned on satisﬁability. This is sometimes referred to as ﬁltering (Kautz et al. 2001). Unfortunately, ﬁltering is difﬁcult in practice (as it requires a complete solver to decide satisﬁability), and creates complicated dependencies

(a) Industrial formula

(b) Community attachment

(c) Uniform random

Figure 1: Constraint graphs for different formulas partitioned into community components by the Gephi tool (Bastian, Heymann, and Jacomy 2009). The industrial formula is a bounded model checking instance. The synthetic formulas both have n = 1162, m = 3738 and k = 3. The community attachment formula has s = 83 and p = 0.696.

that are difﬁcult to control in analysis. To address this, previous work analyzing the running time of evolutionary algorithms on the random propositional satisﬁability model (Sutton and Neumann 2014; Doerr, Neumann, and Sutton 2017; Buzdalov and Doerr 2017) investigate the ﬁtness-distance correlation on the random planted model, in which formulas are forced to be satisﬁable by hiding some satisfying assignment within the formula. For reasonably dense formulas, this corresponds to the uniform random ﬁltered model, i.e., the uniform model conditioned on satisﬁability (Doerr, Neumann, and Sutton 2017). We point out that random planted formulas are easy to solve for classical algorithms (Krivelevich and Vilenchik 2006), but our goal is to advance the theoretical analysis of incomplete search heuristics on random satisﬁability models. To derive performance guarantees on satisﬁable community-structured formulas, we investigate a planted modular satisﬁability model. In this model, we choose uniformly at random an assignment x from the set of all assignments. For each k-set of Boolean variables, we only allow clauses from the 2k 1 unique clauses on those variables that are satisﬁed by x (rather than the original 2k). Similar to the uniform model, for reasonably dense formulas the planted modular SAT model converges to the ﬁltered modular SAT model. The proof of this claim is a straightforward adaptation of the analogous proof for the uniform model (Doerr, Neumann, and Sutton 2017, Theorem 4), and we omit it for space. Formally, we consider propositional k-CNF formulas F over n Boolean variables:

i=1 (ℓi,1 ℓi,2 ℓi,k)

where ℓi,j is one of the n variables or its negation. The set of n variables corresponds to the set [n] := {1, 2, . . . , n}, and we will refer to these sets interchangeably. Let s := s(n) be

a divisor of n. The community attachment model (Gir aldez Cru and Levy 2016) is deﬁned as the set of all propositional k-CNF formulas F(n, m, k, s, p) that are constructed as follows. Partition [n] into t = n/s disjoint subsets Γ1, . . . , Γt where Γℓ= {s(ℓ 1) + 1, s(ℓ 1) + 2, . . . , sℓ}. We call a k-CNF clause localized if it contains only k literals that involve variables from the same community. We call a clause separated if it does not contain any pairs of literals that involve variables from the same community. Each clause is generated by drawing a localized clause with probability p uniformly at random, and a separated clause uniformly at random with probability 1 p. Note that, except when k = 2, the model does not contain uniform random k-SAT as a special case. When k = 2, setting p = (s 1)/(n 1) recovers the uniform random 2-SAT model. Empirically, the model correlates better to industrial formulas in the sense that solvers that specialize on industrial instances perform better on highly modular instances than on formulas that are closer to uniform (Gir aldez-Cru and Levy 2016). In real world data, the density of communities is not uniform. For example, the density of each community for the industrial bounded model checking formula shown on the left of Figure 1 is charted in Figure 2. To address this phenomenon, we generalize the community attachment model of Gir aldez-Cru and Levy (2016). In particular we deﬁne a distribution 0 πℓ 1 for ℓ [t] such that Pt ℓ=1 πℓ= 1. A localized clause is chosen from community ℓwith probability πℓ. The original community attachment model is recovered with the uniform distribution πℓ= 1/t for all ℓ [t]. Let F be a propositional formula on n vertices expressed as the conjunction of m disjunctive clauses. The constraint graph of F is an undirected graph G = (V, E) whose n vertices V represent the variables of F, and (u, v) E if

Figure 2: Relative density of communities in industrial bounded model checking instance in Figure 1a. For a community of size s, the community density is the fraction of s 2 pairs that are connected by an edge.

and only if u and v appear as literals together in a clause of F. The constraint graph of a formula is an important tool to understanding the structure of the formula, and how that structure might be exploited by SAT solvers. The modularity of G (Newman and Girvan 2004) measures the strength of division into communities. Given a partition Γ of V into communities Γ1, Γ2, . . ., the modularity of G with respect to Γ is the fraction of edges within a community minus the expected fraction if the edges were randomly distributed. Given a community Γℓ Γ, denote as E(Γℓ) E the set of edges within the community. The modularity of G with respect to Γ is

Q(G, Γ) = X

The modularity of G is Q(G) = maxΓ Q(G, Γ). For a community-structured SAT formula F F(n, m, k, s, p), the expected modularity of the underlying constraint graph of F is bounded by E(Q(G)) p s/n (Gir aldez-Cru and Levy 2016, Theorem 2). The (1+1) EA is a basic randomized search heuristic that functions as a kind of degenerate case for more complicated evolutionary techniques. It maintains a population size of one, and produces a single offspring via a mutation operation in each generation. Selection is performed by comparing the offspring to its parent. The parent is replaced only if the offspring is at least as ﬁt as the parent. The mutation operation changes the current individual in a minimal way. In the case of length-n binary strings, the state of each variable is negated with probability 1/n. In this way, the (1+1) EA, illustrated in Algorithm 1 operates similarly to a simple local search hill-climber (such as GSAT), with the distinction that in each step, larger jumps in the search space are possible. The probability that mutation produces a jump of Hamming distance d falls off as a degree-d polynomial in n. The distinction between the (1+1) EA and a simple hill-climber is small, but important. Current running time bounds for local search-like heuristics on logarithmically dense formulas of the uniform model depend on the ability of the algorithm to change more than one variable at a time (Doerr, Neumann, and Sutton 2017).

Algorithm 1: (1+1) EA

1 Choose x uniformly at random from {0, 1}n;

2 while stopping criterion not met do

4 foreach i {1, . . . , n} do

5 With probability 1/n, yi (1 yi);

6 if f(y) f(x) then x y;

Search space of planted modular formulas As is usual with evolving assignments to propositional satisﬁability formulas, we consider each full assignment to n Boolean variables as strings over a binary alphabet. This allows us to identify each assignment with a ﬁtness function f : {0, 1}n {0, 1, . . . , m} such that f(x) counts the number of clauses satisﬁed by the assignment corresponding to the binary string x. Similar to previous analyses on propositional satisﬁability (Sutton and Neumann 2014; Doerr, Neumann, and Sutton 2017; Buzdalov and Doerr 2017), we will rigorously characterize the ﬁtness-distance correlation in typical ﬁtness landscapes induced by f together with the Hamming metric on {0, 1}n. Fitness-distance correlation was introduced decades ago as a measure of difﬁculty for evolutionary algorithms (Jones and Forrest 1995), and describes the relatedness of ﬁtness values to the distance to an optimal solution. In the context of planted formulas, we prove that the ﬁtness of a point is strongly correlated to the distance from the planted solution (although a search algorithm may discover a different satisfying assignment before reaching the planted solution). In the community attachment model, it will be useful to partition the ﬁtness function by contribution from different communities. We take fℓ(x) to be the count of localized clauses from community ℓthat are satisﬁed by x, and fsep(x) to be the count of separated clauses satisﬁed by x. Thus

f(x) = fsep(x) +

ℓ=1 fℓ(x). (1)

Furthermore, when I [t], we take f I(x) to mean the contribution P

ℓ I fℓ(x) to f from the localized clauses in the communities indexed by I. In the following section, we introduce a number of deﬁnitions and preliminary lemmas that support the main result.

Preliminaries We deﬁne the following distance notation, keeping in mind that x corresponds to a ﬁxed but arbitrary planted assignment. Deﬁnition 1. For x, y {0, 1}n, we denote the Hamming distance of x and y as d(x, y). For a community Γℓ [n], we denote

d(x, y; Γℓ) = |{i Γℓ: xi = yi}|

to be the Hamming distance of x and y restricted to community Γℓ. Finally, when working with a planted solution x ,

we often simply use d(x) and d(x; Γℓ) to denote d(x, x ) and d(x, x ; Γℓ), respectively. We require the following deﬁnition that characterizes binary strings that are in some sense not too far from the planted solution when restricted to any community. Deﬁnition 2. For a formula in Fx (n, m, k, s, p), we say a point x {0, 1}n is ϵ-balanced with respect to the formula if d(x; Γℓ) s(1/2 + ϵ) for every ℓ {1, 2, . . . , n/s}. As expected, so long as s is not too small, most strings in {0, 1}n are ϵ-balanced. Moreover, we will also implicitly use the straightforward fact that the ϵ-balance property is closed under steps toward the planted solution. Lemma 1. For every ﬁxed 0 < ϵ < 1/2, if F is a formula in Fx (n, m, k, s, p) with each s = ω(log n), then a uniformly chosen string from {0, 1}n is ϵ-balanced with respect to F with probability 1 e Ω(s).

Proof. Let x {0, 1}n be chosen uniformly at random. Thus for any i [n], Pr(xi = x i ) = 1/2. For an arbitrary partition Γℓ, the probability that d(x; Γℓ) > (1 + 2ϵ) s

2 is at most exp( 2sϵ2/3). Applying a union bound over all t < n communities, x violates ϵ-balance in at least one community with probability at most exp( 2sϵ2/3+ln t) = e Ω(s).

We also deﬁne the following expression to count the clauses in a particular clause set that have a different satisﬁability state between two assignments. Deﬁnition 3. Let C be a set of clauses. For any x, y {0, 1}n, denote as xy(C) the number of clauses in C that are unsatisﬁed by x but satisﬁed by y. Finally, we will make use of the following simple proposition to count types of clauses in a formula. Proposition 1. Let d k s be integers. Then

Proof. Since r d r = d d 1 r 1 , we can rewrite the sum on the LHS as

s d k (r + 1)

= d s 1 k 1

where we have applied the Chu-Vandermonde identity (Gould 1956).

Fitness signal from localized clauses For a particular community Γℓ, denote as Cℓthe set of all legal k-CNF clauses formed from Γℓthat are satisﬁed by x . For all ℓ {1, 2, . . . , t}, |Cℓ| = (2k 1) s k since there are 2k ways to negate variables chosen from any k-set of Γℓ, but exactly one of them is unsatisﬁed by x . We count the potential contribution to the increase in ﬁtness of any legal clause in Cℓfor all the points that are closer to the planted solution.

Lemma 2. Consider a community Γℓand let x {0, 1}n be arbitrary. Let Cℓdenote the set of localized clauses for Γℓthat are satisﬁed by a planted assignment x . Then

y:d(x) d(y)=1 xy(Cℓ) = d(x; Γℓ)k

Proof. Let I Γℓbe the index set corresponding to the positions in x that differ from x in community Γℓ. A clause in Cℓis unsatisﬁed by x but satisﬁed by some y with d(x) d(y) = 1 if (1) all its literals are false under x, and (2) it contains a literal l such that the variable associated with l appears in I. If it contains r |I| literals with associated variables indexed by elements of I, then that clause contributes exactly r times to the sum of xy(Cℓ) over the Hamming neighbors of x that are closer to x . Such a k-clause can be constructed by choosing the r variables from I and the remaining k r variables from Γℓ\ I. The negation of the k variables in the clause is the unique negation pattern on the k chosen variables so that the clause is not satisﬁed by x. There are |I| r s |I| k r ways to choose a k-set that corresponds to a clause that contributes r to the sum, and the total contribution of all such clauses to the sum is given by Proposition 1. Since |I| = d(x; Γℓ), the RHS of the claim follows from the proposition. To ensure that we have not overor under-counted, note that each of these clauses contain r > 0 literals, each of which must be true under some y with d(y) = d(x) 1. For any one of these literals, there is a unique i |I| that corresponds to the variable that must be ﬂipped to move from x to y. Since d(x; Γℓ) d(y; Γℓ) = 1, yi = x i , and so that literal is also true under x . It must therefore be the case that each of the clauses we have counted above are also satisﬁed by x , and are hence all valid clauses in Cℓ.

We also need to count the potential contribution to the decrease in ﬁtness of any legal clause in Cℓfor all the points that are one step closer to the planted solution. This is captured by the following lemma.

Lemma 3. Consider a community Γℓand let x {0, 1}n be arbitrary. Let Cℓdenote the localized clauses for Γℓthat are satisﬁed by a planted assignment x . Then X

y:d(x) d(y)=1 yx(Cℓ) = d(x; Γℓ) s 1 k 1 s d(x;Γℓ) k 1 .

Proof. For each k-set of variables from Γℓ, there is only one way to negate the variables to produce a unique clause (up to commutativity) that is not satisﬁed by an assignment y. There are d(x; Γℓ) bitstrings y with d(x) d(y) = 1 that differ from x in exactly one position i Γℓ, and for each of these, s 1 k 1 ways to choose the remaining k 1 variables to appear in a clause. However, s d(x;Γℓ) k 1 of these clauses are not allowed because all k corresponding positions differ from x , and hence such a clause would not be satisﬁed by x , and is therefore not a legal clause of Cℓ. Subtracting the count of these clauses yields the RHS of the claim.

Lemma 4. Let F be a planted modular formula from Fx (n, m, k, s, p) with k, ϵ = Θ(1), and let I [t]. With probability

ℓ I πℓd(x; Γℓ)

the following bound holds at every ϵ-balanced point x.

y:d(x) d(y)=1 f I(y) f I(x) = Θ

ℓ I πℓd(x, Γℓ)

Proof. Fix x to be an arbitrary ϵ-balanced point. Let N = {y : d(x) d(y) = 1}. Note that x , x and N are ﬁxed for the remainder of the proof. We deﬁne two sequences of m random variables over the probability space of randomly constructed formulas F from Fx (n, m, k, s, p). Let αi denote the i-th clause of F and let {Y1, Y2, . . . , Ym} be the sequence

Yi = 0 if αi is separated or true under x, |{y N : αi is true under y}| otherwise.

If αi is localized, Yi counts the total number of times that αi switches from false to true in the set of Hamming neighbors of x that are strictly closer to the planted solution. Similarly, we deﬁne the sequence {Z1, Z2, . . . , Zm} as

Zi = 0 if αi is separated or false under x, |{y N : αi is false under y}| otherwise.

Since a clause can be false under only one conﬁguration of its k variables, Zi {0, 1}. When αi is localized, Zi indicates the presence of an element of N for which αi switches from true to false. As each of the m clauses is chosen independently, both {Yi : i [m]} and {Zi : i [m]} are sequences of independent random variables. For any ℓ I, we choose clause αi from Cℓwith probability pπℓ. Furthermore, the probability that αi is chosen from the set of clauses that change from false to true when moving from any x to y is exactly xy(Cℓ)/|Cℓ|, and we calculate the expectation as follows.

E(Yi) = p X

y:d(x;Γℓ) d(y;Γℓ)=1 xy(Cℓ)

= p (2k 1) s k X

πℓd(x; Γℓ)k

, by Lemma 2

= pk (2k 1)s

ℓ I πℓd(x; Γℓ).

E(Zi) = p X

y:d(x;Γℓ) d(y;Γℓ)=1 yx(Cℓ)

P ℓ I πℓd(x; Γℓ) s 1 k 1 s d(x;Γℓ) k 1 ,

which holds by Lemma 3. Furthermore, since x is supposed to be ϵbalanced, d(x; Γℓ) s(1/2 + ϵ), and thus

P ℓ I πℓd(x; Γℓ) s 1 k 1 s(1/2 ϵ) k 1

= p P ℓ I πℓd(x;Γℓ)

k s k(1/2 ϵ)k 1

= pk P ℓ I πℓd(x;Γℓ)

(2k 1)s (1 (1/2 ϵ)k 1 + O(1/s)).

Note that the total ﬁtness change in localized clauses indexed by I moving from x to all y N is equivalent to the difference in the sums of the random variables that count localized clauses that switch state between x and all y N. In other words, X

y:d(x) d(y)=1 f I(y) f I(x) =

i=1 (Yi Zi) .

As 0 Yi k and 0 Zi 1, setting Y = Pm i=1 Yi/k and Z = Pm i=1 Zi, for any arbitrary constant 0 < δ < 1 the probability Y [(1 δ)E(Y/k), (1 + δ)E(Y )] and Z [0, (1 + δ)E(Z)] is bounded as claimed by applying a multiplicative Chernoff bound, completing the proof.

Efﬁciently solving constant-density formulas The result of Lemma 4 estimates the probability that a small local mutation can detect a move in the direction of the planted solution. In the original community attachment model, πℓ= 1/t, that is, a localized clause is chosen uniformly over all communities. We consider a slight generalization in which communities are no longer required to be uniform in density. This generalization is arguably more realistic, as it more closely matches the variability in community density observed in real-world data (see Figure 2). For any small constant 0 < ε < 1, we partition the communities so that a nonempty set I [t] of communities are dense in the sense that πℓ= Ω(1/t1 ε) for each ℓ I and the remaining are sparse so that πℓ= O(1/t1+ε) for ℓ [t] \ I. Theorem 1. Let 0 < ε < 1 be any small constant. Let s ω(log n) O(nε/(2ε+2)), set p = 1 O(1/n1+ε) and k = Θ(1). Let I [t] with

πℓ= Ω 1 t1 ε for ℓ I, O 1 t1+ε for ℓ [t] \ I.

For constraint densities m/n c where c is a sufﬁciently large constant, all but a vanishing fraction of planted modular formulas F in Fx (n, m, k, s, p) are solved in polynomial expected time by the (1+1) EA.

Proof. We ﬁrst argue that all but a fast-vanishing fraction of formulas in Fx (n, m, k, s, p) have a good drift condition on f I. Let x {0, 1}n be an arbitrary ϵ-balanced string. Deﬁne

N = {y : d(x) d(y) = 1 xi = yi = ℓ I, i Γℓ}

to be the set of P

ℓ I d(x; Γℓ) neighbors that are closer to x in some dense community indexed by I. Applying Lemma 4, the bound

y N f I(y) f I(x) = Θ

ℓ I d(x, Γℓ)

holds with probability

ℓ I d(x; Γℓ)

Since f I is independent of any bit not in S

ℓ I Γℓ, applying a union bound over all ϵ-bounded length-|I|t substrings indexed by S

ℓ I Γℓlying at distance d from the substring of x indexed by S ℓ I Γℓ, we ﬁnd this strong drift condition holds with probability at least

exp Ω(dnε2) 1 exp Ω(nε2) .

Here we have used the fact that m/(st1 ε) = Ω(nε2). We now show that the ﬁtness contribution from other clauses is small. The expected number of separated clauses in the entire formula is bounded as cn(1 p) = O(1/nε) which exceeds a constant only with probability o(1) by Markov s inequality. Therefore, with probability 1 o(1), at every ϵ-balanced point, X

y N f(y) f(x) X

y N f I(y) f I(x) O(1)

ℓ I d(x, Γℓ)

ℓ I d(x; Γℓ). Recall that N is the set of all Hamming neighbors of x that are closer to x by some variable in a dense community (those indexed by I). A simple proof by induction on distance in dense communities from the planted solution (see Lemma 6 of (Doerr, Neumann, and Sutton 2017)) yields that f I(x) + fsep(x) = O(d) since m/n = Θ(1). As any Hamming neighbor of x is generated by standard mutation with probability 1 1

n e 1/n, we can bound the drift of f I + fsep at every ϵ-balanced point x by

y N max {0, f(y) f(x)} = Ω

= Ω f I(x) + fsep(x)

As long as the the (1+1) EA never generates a point that is not ϵ-balanced with respect to the dense communities, there is sufﬁcient drift toward the planted solution in the length- |I|t substring induced by the dense communities. The initial point is drawn uniformly at random, so Lemma 1 guarantees an ϵ-balanced string with probability at least 1 e Ω(nε). Finally, as long the current point x of the (1+1) EA is ϵbalanced, the probability of reaching a non-ϵ-balanced point within any polynomial number of iterations is superpolynomially close to zero by using Lemma 4 together with the negative drift theorem (Oliveto and Witt 2011; 2012) at every dense community. Applying the multiplicative drift theorem (Doerr, Johannsen, and Winzen 2012), the dense communities are solved in expected time O(n1 ε2 log n).

It remains to bound the time until the remaining variables are solved. Fix ℓ [t] \ I to be the index to an arbitrary sparse community, and let {Xi : 1 i m} be the indicator random variable for localized clauses from Γℓ. The expected number of localized clauses in the community is E (Pm i=1 Xi) = cnπℓ= O 1 nε/2 , since for sparse communities we have πℓ= O(t (1+ϵ)) and we have assumed the community-size bound of s = O(nε/(2ε+2)). Finally, we apply a standard Chernoff bound to the sum of indicator random variables to ensure that the sum only exceeds a constant b > 1 with probability O(1/nbε). Thus each of the O(t) sparse communities contribute at most b localized clauses to f with O(1/nε) probability. Since we have assumed there are only a constant number of separated clauses in total, the total number of clauses associated with bits in Γℓis a constant, and each of the remaining community substrings can be optimized in O(tn O(1)) = n O(1) time, concluding the proof.

Experiment Results In order to characterize the leading constants and demonstrate the tightness of the bound, we perform a number of numerical experiments to measure the run time of the (1+1) EA on the community attachment model. Since our proofs only require the constraint density to be a sufﬁciently large but unspeciﬁed constant, we generate run length distribution (RLD) curves of the (1+1) EA as a function of constraint density to observe its behavior. In Figure 3, we ﬁx n = 1000, s = 100 and p = 3/4 and vary m to plot the empirical RLD curves. On the standard uniform planted model there is a critical constant density near which the performance of the (1+1) EA degrades signiﬁcantly (Doerr, Neumann, and Sutton 2017). There is likely to be a similar effect in the community attachment model, and we conjecture that the performance degradation as density decreases is an expression of this effect. According to the RLD curves of Figure 3, the density of m/n = 5e is sufﬁciently large to escape this deterioration. We therefore choose this constant to be large enough of a density for our run time experiments. For each s = {100, 110, 120, . . . , 1000}, we generate 10 modular formulas using the community attachment model with n = s3/2 and m/n = 5e. On each formula, we measure the run time of the (1+1) EA for 10 trials. To establish a segmentation into dense and sparse communities, we identify t(1 ε) communities as dense and choose localized clauses uniformly from dense communities with probability 1 t ε. The remaining communities are chosen uniformly with probability t ε. Thus a localized clause from a particular dense community is generated with probability (1 t ε)/ t(1 ε) = Ω(1/t(1 ε)) and from a particular sparse community with probability t ε/(t t(1 ε) ) = O(1/t1+ε) as required by the proof of Theorem 1. We conjecture, however, that the asymptotic behavior of the (1+1) EA does not depend strongly on this condition. Nor do we believe that the requirement for vanishing separated clauses is necessary. In Figure 4, we plot the median run time divided by n ln n

104.5 105 105.5 10 2

generations

proportion solved

m/n = 2e m/n = 3e m/n = 4e m/n = 5e

Figure 3: Empirical run length distributions on community attachment model, controlling for density. The curves are created by 100 runs of the (1+1) EA at each constraint density value m/n {2e, 3e, 4e, 5e}. Each formula contains n = 1000 variables, community size s = 100, and localized clause probability p = 3/4.

measured in the experiments for p {3/4, 1/4} and ε = 1/5. We also compare the run time of the (1+1) EA solving formulas generated with the community attachment model over uniform communities. Empirically, the median run time converges to roughly 3.1 n ln n, thus our time bound for solving the dense communities appears to be closer to the truth, even when the count of separated clauses is non-vanishing. We also ﬁnd support for our conjecture that the uniformity of communities does not change the asymptotic behavior dramatically, and overall, the running time seems to be largely invariant to localized clause probability.

Conclusions The uniform random SAT model has been studied intensely for decades. Recently, several works have introduced nonuniform models that produce formulas exhibiting structural features that correspond better to real-world industrial instances. In this paper we push the ﬁeld of run time analysis of randomized search heuristics toward these non-uniform models by conducting a rigorous analysis of the running time of the (1+1) EA on the recently proposed community attachment random SAT model, which has tunable modularity. We prove that, as long as the the community size is polynomially smaller than the number of variables, the EA can be efﬁcient down to constant constraint densities. This can be contrasted to the uniform random model, on which the best known results require a density of Ω(log n). Our analysis leverages the fact that in the community-structured

0.5 1 1.5 2 2.5 3 104

# iterations /n ln n

p = 3/4, ε = 1/5

p = 1/4, ε = 1/5

p = 3/4, uniform communities

p = 1/4, uniform communities

Figure 4: Median run time of the (1+1) EA divided by n ln n as a function of n on formulas sampled from the community attachment model, p {3/4, 1/4}, and m = 5en. Both non-uniform communities (ε = 1/5) and uniform communities are plotted. The error bars denote the interquartile range. The statistics are taken from 10 runs each on 10 random formulas generated for each value of n = s3/2 varying community size s from 100 to 1000.

formulas the ﬁtness signal uncovered by mutations concentrates around the frequency of dense localized clauses, rather than at the overall constraint density. Nevertheless, we conjecture that the asymptotic behavior of the EA is similar on constant-density uniform formulas.

References Ans otegui, C.; Bonet, M. L.; Gir aldez-Cru, J.; and Levy, J. 2014. The Fractal Dimension of SAT Formulas. In Proceedings of the Seventh International Joint Conference on Automated Reasoning (IJCAR), 107 121. Springer. Ans otegui, C.; Bonet, M. L.; and Levy, J. 2009. On the structure of industrial SAT instances. In Proceedings of the Fifteenth Conference on Constraint Programming (CP), 127 141. Bastian, M.; Heymann, S.; and Jacomy, M. 2009. Gephi: An open source software for exploring and manipulating networks. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media (ICWSM). Buzdalov, M., and Doerr, B. 2017. Runtime analysis of the (1 + (λ, λ)) genetic algorithm on random satisﬁable 3-CNF formulas. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), 1343 1350. ACM. Doerr, B.; Johannsen, D.; and Winzen, C. 2012. Multiplicative drift analysis. Algorithmica 64:673 697. Doerr, B.; Neumann, F.; and Sutton, A. M. 2017. Time

complexity analysis of evolutionary algorithms on random satisﬁable k-CNF formulas. Algorithmica 78(2):561 586. Gir aldez-Cru, J., and Levy, J. 2016. Generating SAT instances with community structure. Artiﬁcial Intelligence 238:119 134. Gir aldez-Cru, J., and Levy, J. 2017. Locality in random SAT instances. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI), 638 644. Gould, H. W. 1956. Some generalizations of Vandermonde s convolution. The American Mathematical Monthly 63(2):84 91. Jones, T., and Forrest, S. 1995. Fitness distance correlation as a measure of problem difﬁculty for genetic algorithms. In Proceedings of the Sixth International Conference on Genetic Algorithms (ICGA), 184 192. Morgan Kaufmann. Kautz, H. A.; Ruan, Y.; Achlioptas, D.; Gomes, C. P.; Selman, B.; and Stickel, M. E. 2001. Balance and ﬁltering in structured satisﬁable problems (preliminary report). Electronic Notes in Discrete Mathematics 9:2 18. Krivelevich, M., and Vilenchik, D. 2006. Solving random satisﬁable 3CNF formulas in expected polynomial time. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 454 463. ACM Press. Newman, M. E. J., and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69:026113. Oliveto, P. S., and Witt, C. 2011. Simpliﬁed drift analysis for proving lower bounds in evolutionary computation. Algorithmica 59(3):369 386. Oliveto, P. S., and Witt, C. 2012. Erratum: Simpliﬁed drift analysis for proving lower bounds in evolutionary computation. ar Xiv:1211.7184 [cs.NE]. Sch oning, U. 1999. A probabilistic algorithm for k-SAT and constraint satisfaction problems. In Proceedings of the Fortieth Annual Symposium on Foundations of Computer Science (FOCS), 410 414. IEEE Computer Society. Selman, B.; Kautz, H. A.; and Cohen, B. 1994. Noise strategies for improving local search. In Proceedings of the Twelfth National Conference on Artiﬁcial Intelligence (AAAI), 337 343. Sutton, A. M., and Neumann, F. 2014. Runtime analysis of evolutionary algorithms on randomly constructed highdensity satisﬁable 3-CNF formulas. In Proceedings of the Thirteenth International Conference on Parallel Problem Solving from Nature (PPSN), volume 8672 of Lecture Notes in Computer Science, 942 951. Springer.