# sweepbased_propagation_for_string_constraint_solving__9dda54c0.pdf

Sweep-Based Propagation for String Constraint Solving

Roberto Amadini, Graeme Gange, Peter J. Stuckey Department of Computing and Information Systems The University of Melbourne, Australia {roberto.amadini,gkgange,pstuckey}@unimelb.edu.au

Solving constraints over strings is an emerging important ﬁeld. Recently, a Constraint Programming approach based on dashed strings has been proposed to enable a compact domain representation for potentially large bounded-length string variables. In this paper, we present a more efﬁcient algorithm for propagating equality (and related constraints) over dashed strings. We call this propagation sweep-based. Experimental evidences show that sweep-based propagation is able to signiﬁcantly outperform state-of-the-art approaches for string constraint solving.

Introduction

Constraint solving over strings is an important ﬁeld, given the ubiquity of strings in different domains such as, e.g., software veriﬁcation and testing (Emmi, Majumdar, and Sen 2007; Bjørner, Tillmann, and Voronkov 2009), model checking (Gange et al. 2013), and web security (Bisht et al. 2011; Thom e et al. 2017). Various approaches of disparate nature have been proposed, based for instance on bit-vectors (Kiezun et al. 2012), automata (Hooimeijer and Weimer 2012; Li and Ghosh 2013; Tateishi, Pistoia, and Tripp 2013), SMT extensions (Berzish, Zheng, and Ganesh 2017; Liang et al. 2014; Abdulla et al. 2015; Trinh, Chu, and Jaffar 2014; Yu, Alkhalaf, and Bultan 2010; Saxena et al. 2010), and constraint programming (Scott et al. 2017; Amadini et al. 2017). Typically, we aim to solve problems with bounded-length strings, i.e., strings having a maximum length. These problems are mainly addressed with some variant of unfolding: a string s of length (up to) k is translated into a sequence of single-character variables s1, . . . , sk, and string constraints are imposed in terms of these variables. However, methods which unfold eagerly face scalability problems when the upper bounds on string length becomes large, and committing to absolute character positions makes it difﬁcult to reason precisely when combining strings of non-ﬁxed length. Dashed strings, an alternative representation for modelling the domain of string variables through sub-sequences of uncertain length, were described in (Amadini et al. 2017).

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

In a nutshell, given an alphabet Σ, a dashed string is a concatenation Sl1,u1 1 Sl2,u2 2 Slk,uk k of k blocks, where each block Sli,ui i represents the set of all the strings of Σ having length in [li, ui] and characters in Si Σ. Unfortunately, the dynamic programming algorithm used in (Amadini et al. 2017) for propagating equality, here referred as COVER, suffers its own scalability troubles: its worst-case behaviour is cubic in the number of blocks, so performance degrades rapidly when dealing with long ﬁxed strings. In this paper, we introduce the SWEEP algorithm for propagating equality between dashed strings. In essence, the algorithm performs linear-time sweeps across a dashed string, collecting the earliest and latest positions a given block could be matched. The SWEEP algorithm provides weaker propagation than COVER, but has much lower worst-case complexity: as we shall see, this can dramatically improve the resolution when the number of blocks becomes large. We implemented the SWEEP algorithm by extending GSTRINGS, a string solver introduced in (Amadini et al. 2017) which represents string variables with dashed strings, and we compared its performance against the COVER algorithm (already implemented in G-STRINGS) and other stateof-the-art approaches such as: the CP-based string solver GECODE+S (Scott et al. 2017); the CP solvers CHUFFED, GECODE, IZPLUS (by mapping strings into array of integers); the SMT-based string solvers Z3STR3 (Berzish, Zheng, and Ganesh 2017) and CVC4 (Liang et al. 2014). To make the empirical evaluation more meaningful, we extended the benchmarks used in (Amadini et al. 2017) by generating new instances of the SQL problem introduced in (Amadini et al. 2016). Empirical results indicate that sweepbased propagation allows us to considerably outperform all the aforementioned approaches.

Dashed Strings In this Section we give some preliminary notions about dashed strings, including the COVER algorithm introduced in (Amadini et al. 2017) for equating dashed strings,1 and the CP solver G-STRINGS, which implements COVER.

1In (Amadini et al. 2017) the COVER algorithm is actually called EQUATE. Here we call it COVER to differentiate it from the SWEEP algorithm, since both COVER and SWEEP are used for equating dashed strings.

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

B, b o o o o m ! ! !

Figure 1: Representation of {B,b}1,1{o}2,4{m}1,1{!}0,3.

Let us ﬁx an alphabet Σ, a maximum string length ℓ N, and the universe S = ℓ i=0 Σi. A dashed string of length k is deﬁned by a concatenation of 0 < k ℓblocks Sl1,u1 1 Sl2,u2 2 Slk,uk k , where Si Σ and 0 li ui ℓ for i = 1, . . . , k, and Σk i=1li ℓ.

For each block Sli,ui i , we call Si the base and (li, ui) the cardinality. For brevity we will sometimes write a block with singleton base {a}l,u as al,u. The i-th block of a dashed string X is denoted by X[i], and |X| is the number of blocks of X. We do not distinguish blocks from dashed strings of unary length. For each pair of the form C = (l, u), we deﬁne lb(C) = l and ub(C) = u. Let γ(Sl,u) = {x S | l |x| u} be the language denoted by block Sl,u. In particular the null element 0,0 is such that γ( 0,0) = {ϵ}. We extend γ to dashed strings: γ(Sl1,u1 1 Slk,uk k ) = (γ(Sl1,u1 1 ) γ(Slk,uk k )) S. A dashed string X is known if |γ(X)| = 1, i.e., it represents a single string. A dashed-string X is nullable if ϵ γ(X); or equivalently, each of its blocks has minimum cardinality 0. We assume normalised dashed strings, where 0,0 can occur at most once and adjacent blocks always have distinct bases. The size Sl,u of a block is the number of concrete strings it denotes, i.e., Sl,u = |γ(Sl,u)|. The size of dashed string X = Sl1,u1 1 Sl2,u2 2 Slk,uk k is instead an overestimate of |γ(X)|, given by X = Πk i=1 Sli,ui i . Given two dashed strings X and Y we deﬁne the relation X Y γ(X) γ(Y ). Intuitively, models the relation is more precise than between dashed strings. Unfortunately, the set of dashed strings does not form a lattice according to . This implies that some workarounds have to be used to determine a reasonable lower/upper bound of two dashed strings according to . For example, it is sometimes useful to over-approximate a dashed string X = Sl1,u1 1 Sl2,u2 2 Slk,uk k into a single block. We deﬁne an upper closure operator CRUSH such that CRUSH(X) = Sl,u where S = k i=1 Si, l = k i=1 li, and u = k i=1 ui. Moreover, we also consider a restricted version of CRUSH only crushing blocks in X that are compatible with a set of characters T, that is, CRUSHT (X) = CRUSH([(Si T)li,ui | i {1, . . . , k}, Si T = ]).

Intuitively, we can imagine each block Sli,ui i of X = Sl1,u1 1 Sl2,u2 2 Slk,uk k as a continuous segment of length li followed by a dashed segment of length ui li. The continuous segment indicates that exactly li characters of Si must occur in each concrete string of γ(X); the dashed segment indicates that k characters of Si, with 0 k ui li, may occur. Consider Fig. 1, illustrating dashed string X = {B,b}1,1{o}2,4{m}1,1{!}0,3. Each string of γ(X) starts with B or b, followed by 2 to 4 os, one m, then 0 to 3 !s.

COVER Algorithm

Equating dashed strings X and Y requires determining, if possible, two dashed strings X and Y such that: (i) X X, Y Y ; (ii) γ(X ) γ(Y ) = γ(X) γ(Y ). Informally, we can see this problem as a semantic uniﬁcation where we want to ﬁnd a reﬁnement of X and Y including all the strings of γ(X) γ(Y ) and removing the most values not belonging to γ(X) γ(Y ) (note that there may not exist a greatest lower bound for X, Y according to ). In (Amadini et al. 2017) the authors propose a multiphase equation algorithm for equating dashed strings. We now call it COVER because it is based on the notion of coverable blocks, where block Sl,u is coverable by T l ,u if some string in γ(Sl,u) is a preﬁx of a string in γ(T l ,u ) (formally, if l = 0 (l u S T = )). If block B is coverable by B , or B is coverable by B, then B and B are compatible; otherwise, they are incompatible. The COVER algorithm is difﬁcult to summarize: we refer the reader to (Amadini et al. 2017) for a full explanation. In a nutshell, COVER ﬁrst checks if dashed strings X and Y are equatable. This is performed by a top-down dynamic programming approach, recursively matching pairs of compatible blocks. This phase returns a directed acyclic graph encoding a set of matchings, i.e., computations that successfully consume all the blocks of X and Y . Then, all the matchings are processed and merged together (often by means of the CRUSH operation) to possibly update X and Y with reﬁned dashed strings X and Y . As an example, consider X = {a..c}0,30{d}5,5{c..f}0,2 and Y = {b..d}26,26{f}1,1. The COVER algorithm reﬁnes X in X = {b, c}20,21{d}5,5{c, d}0,1{f}1,1, while Y = Y . Note that COVER does not aim to ﬁnd an optimal reﬁnement, since this requirement is too expensive to meet. The main argument against COVER is its worst case complexity O(nm max(n, m)), where n = |X|, m = |Y |. Indeed, despite having good results when the number of blocks is small (regardless of the the maximum length a string may have) COVER can be expensive when a string must be very long (this naturally implies a large number of blocks).

G-STRINGS Solver

The COVER algorithm was implemented in G-STRINGS, an extension of GECODE solver (Gecode 2017). G-STRINGS is a copying solver, i.e., during the search the domains are copied and possibly restored. In (Amadini et al. 2017) the authors demonstrate that, despite being a prototype, GSTRINGS already shows promising performance. G-STRINGS implements the domain of every string variable x with a dashed string D(x), and deﬁnes a propagator for each string constraint. Propagators take advantage of COVER for reﬁning the representations of the involved variables. For example, string equality x = y is simply propagated by equating D(x) and D(y) with COVER. Dashed strings naturally support concatenation: e.g., the propagator for z = x y is implemented by equating D(z) and D(x) D(y), where D(x) D(y) is the concatenation of the blocks of D(x) and D(y), taking care of properly projecting the narrowing of D(x) D(y) on D(x) and D(y).

G-STRINGS implements the following constraints: string (dis-)equality, (half-)reiﬁed equality, (iterated) concatenation, string domain, length, reverse, substring selection. Since propagation is in general not complete, G-STRINGS also deﬁnes a search strategy that ﬁrst chooses the string variable x with smallest domain size D(x) and then, if the length of x is unknown, it branches on the ﬁrst unknown length block Sli,ui i being equal to its minimal length or not (i.e., Sli,li i or Sli+1,ui i ). Otherwise, if the ﬁrst non-nullable block Sli,li i is of length li > 1 it splits it into two ﬁxed length blocks S1,1 i Sli 1,li 1 i . If the ﬁrst non-nullable block Sli,li i is of length 1 it branches on setting the block to its least value a = min(Si) or not (i.e., {a} or Si {a}).

Sweep-based Propagation

As noted in (Amadini et al. 2017), the dynamic programming method for equality propagation performs poorly when long ﬁxed strings are interleaved with unknown regions. In this section we present SWEEP: a more robust, though weaker, approach for enforcing equality. The core idea of SWEEP is similar to timetable reasoning for CUMULATIVE (Aggoun and Beldiceanu 1993): to equate X and Y , for each block X[i], we wish to ﬁnd the earliest and latest positions in Y where X[i] could be matched. Once these positions are computed, they are used to reﬁne the block: roughly, X[i] may only contain a content between its earliest start and latest end, and any content between the latest start and earliest end must be included in X[i]. This process is repeated symmetrically to reﬁne each block Y [j]. Given a dashed string X, we shall refer to positions (i, o), where i refers to block X[i] and o is its offset, indicating how many characters of X[i] have already been consumed. Positive offsets denote positions relative to the beginning of X[i], and negative offsets are relative to the end of X[i]. From an initial position pi in Y , we try to match X[i] against the blocks of Y immediately following pi. If we succeed, pi is the earliest start time of X[i] denoted est(X[i]), and the end of this match identiﬁes the earliest end time of X[i], denoted eet(X[i]). However, if while doing so we reach a block Y [j] incompatible with X[i], then X[i] must necessarily be matched somewhere after Y [j]. SWEEP uses two (pairs of) complementary operations, PUSH and STRETCH. PUSH+(B, Y, (i, o)) attempts to ﬁnd the earliest possible match of B after o characters into Y [i] (the ith block of Y ), while STRETCH+(B, Y, (i, o)) ﬁnds the latest position B could ﬁnish, assuming B begins at most o characters into Y [i]. There are analogous versions PUSH and STRETCH which work backwards across the blocks. They simply invert the blocks of Y and then convert the returned results back to the original offset descriptions. The algorithm for PUSH+ is given in Figure 2, and explained in Example 1 with a running example.

Example 1 Consider ﬁnding the earliest matching of block B = {a, b}3,4 into the dashed string Y = Y1 Y2 Y3 Y4 Y5 = {a}2,3{c}1,2{b}1,1{c}0,2{a}3,4 given an initial offset of 1 into Y1, illustrated in Figure 3. We begin trying to ﬁt as much as possible of B into the remainder of {a}2,3. However the

function PUSH+(Sl,u, Y , (i, o)) (is, os) (i, o) k l while k > 0 i |Y | do T l ,u Y [i] if S T = then Incompatible blocks (i, o) (i + 1, 0) if l > o then If not nullable, (is, os) (i, o) restart from here k l end if else if k (u o) then Remainder ﬁts in Y [i] return (is, os), (i, o + k) else Fill Y [i] and continue k k (u o) (i, o) (i + 1, 0) end if end while return (is, os), (i, o) end function

Figure 2: PUSH+ searches for the earliest feasible location to match Sl,u in Y , assuming it starts no earlier than (i, o). The pair (is, os) tracks the candidate start position, while (i, o) is the corresponding end.

a2,3 c1,2 b1,1 c0,2 a3,4

+1 | {a, b}3

Y1 Y2 Y3 Y4 Y5

a2,3 c1,2 b1,1 c0,2 a3,4

+1 +0 | {a, b}1

a2,3 c1,2 b1,1 c0,2 a3,4

+0 | {a, b}3

a2,3 c1,2 b1,1 c0,2 a3,4

+0 +0 | {a, b}2

a2,3 c1,2 b1,1 c0,2 a3,4

+0 +0 | {a, b}2

a2,3 c1,2 b1,1 c0,2 a3,4

Figure 3: Applying PUSH+ to ﬁnd the earliest match for {a, b}3,4, given initial offset of 1 into Y1. End-point markers show current offset, and characters yet to be consumed. Incompatible blocks are in grey colour.

function STRETCH+(Sl,u, Y , (i, o)) k u while i |Y | do T l ,u Y [i] if (l o) then Nothing to consume (i, o) (i + 1, 0) else if S T = then Incompatible blocks return (i, o) else if k < l o then No more characters return (i, o + k) else l o characters consumed k k (l o) (i, o) (i + 1, 0) end if end while return (i, o) end function

Figure 4: STRETCH+ attempts to ﬁnd the latest end position of Sl,u, assuming it starts no later than (i, o). It consumes the minimum possible of each successive block, and stops when it ﬁnds an incompatible non-nullable block or has consumed all possible characters, or reaches the end.

current block only has 2 capacity remaining, which leaves 1 character to be matched in following blocks. However B is incompatible with Y2. Since B cannot be placed before Y2, it must be placed after; so we restart at the beginning of Y3. After consuming Y3, we reach Y4 another incompatible block, but one with a lower bound of 0. In this case, B may still cross Y4 (by setting Y4 to null), so we skip it and continue matching Y5. Block Y5 consumes the remaining characters, and we terminate. From this, we conclude the earliest start for B is (3, +0), and the earliest start of the following block is at least (5, +2), corresponding to the earliest end of B. The earliest start of B also gives us the earliest end of the predecessor of B: namely (3, 1), or equivalently (2, 0).

As the matching may not contain gaps, the earliest start of B also determines the earliest end of its predecessor in X. So, for successive blocks in X, we may run PUSH+ from the earliest end of its predecessor. However, we can do better by observing that a start position is infeasible if its sufﬁx cannot reach the end of Y . Thus, we use the STRETCH+ procedure to identify the earliest possible matching for the sufﬁx. STRETCH+ is in some sense a dual of PUSH+: where PUSH+ attempts to squeeze as much of B into each block as possible, STRETCH+ consumes only the minimum amount before moving onto the next block of Y . STRETCH+ computes a bound on the latest end time of block B, denoted let(B), from which we may also infer the latest start time of its successor. Analogously, we also deﬁne the latest start time of block B as lst(B). Pseudo-code for STRETCH+ is given in Figure 4, and explained in Example 2 with a running example.

Example 2 Recall the dashed string Y from Example 1. To ﬁnd initial bounds on the latest ends of blocks in X, we run

(a) a2,3 c1,2 b1,1 c0,2 a3,4

+1 | {a, b}4 +0 | {a, b}3

(b) a2,3 c1,2 b1,1 c0,2 a3,4

+1 | {a, b}4 +3 | ϵ

Y1 Y2 Y3 Y4 Y5

Figure 5: Applying STRETCH+ on {a, b}3,4 from two possible start positions in Y . End-point markers show current offset, and characters still available..

STRETCH+ progressively from the beginning of Y . Consider the case (a) shown in Figure 5, where the predecessors of the current block B = {a, b}3,4 can only reach position (1, +1). We have 4 characters available, and start walking along successive blocks of Y , decreasing our budget by the lower bound (i.e., the mandatory characters) of each block we cross. As we start with offset +1 into Y1, its effective lower bound is 2 1 = 1, so we reach the beginning of Y2 with 3 characters remaining. However, Y2 is incompatible with {a, b}, and has a non-zero lower bound. In this case we terminate immediately, concluding that X must end before the beginning of Y2 and accordingly, the successor of X must begin no later than (2, +0). For case (b), we start inside an incompatible block. However our starting offset is already at the lower bound; thus its effective lower bound is 0, and we may continue unopposed. We continue as before, reaching Y4 with 3 characters available. Y4 is again incompatible with {a, b}, but it is nullable and it can be skipped. Our remaining budget is consumed in Y5, yielding a ﬁnal position of (5, +3). Note that while (5, +3) remains inside Y , it nevertheless denotes a feasible end position, as the remainder of the dashed string is potentially null.

Running STRETCH from the end of Y allows us to compute initial (lower) bounds for the earliest starts of X-blocks. If the corresponding earliest start of the ﬁrst block X[1] is strictly later than ﬁrst position (1, +0), then we have detected infeasibility: X and Y can not be equated. The overall algorithm for computing the earliest starts for X-blocks in Y is given in Figure 6. PROP-EST runs in O(m + n) time (assuming set operations are constant time). The algorithm for computing the latest ends is analogous, swapping the sign (i.e., direction) of STRETCH and PUSH.

Filtering from matches Given the earliest and latest matchings for a block B = Sl,u, we then attempt to tighten the base S and cardinality (l, u). The range between est(B) and let(B) is the feasible region B must be fully contained in this region. Similarly, the content between lst(B) and eet(B) forms a sub-region of the feasible region that we call the mandatory region.

function PROP-EST([X1, . . . , Xn], Y = [Y1, . . . , Ym]) end (m, +ub(Ym)) for i {n, n 1, . . . , 1} do Scanning backwards end STRETCH (Xi, Y, end) est(Xi) end end for if est(X1) > (1, +0) then return FAIL Sufﬁx cannot reach end if end (1, 0) for i {1, . . . , n} do Scanning forwards start MAX(end, est(Xi)) start, end PUSH+(Xi, Y, start) if start > est(Xi) then est(Xi) start end if end for if est(Xn) > (m, +ub(Ym)) then return FAIL Preﬁx cannot ﬁt end if return est est is the array of earliest starts end function

Figure 6: Algorithm for computing the earliest start position in Y of each X-block. We run STRETCH backwards (from the last block) to compute initial starting positions, then run PUSH forward to eliminate infeasible positions.

From the feasible region, we can tighten upper bounds on cardinality and the base characters. From the mandatory, we can update lower-bounds on block cardinality. We cannot generally reﬁne feasible values using the mandatory region, as a base expresses only possible, rather than necessary, characters. Consider ranges in Figure 7 showing the earliest and latest matchings of a block Sl,u into a dashed string Y . The mandatory region starts one character into Y2, and reaches the end of Y3: the minimum cardinality l is at least l = max(0, lb(C2) 1) + lb(C3). We can use similar reasoning over the feasible region to prune the upper bound of cardinality, but we need to consider only those blocks which are

Y C1 1 Y C2 2 Y C3 3 Y C4 4 Y C5 5

Figure 7: Filtering a block using matching bounds. Feasible bounds (est and let) are indicated with dashed lines, labelled with the offset into the speciﬁed block. Mandatory bounds (lst and eet) are indicated by solid lines.

Y 0,1 1 Y 0,1 2 Y 0,1 n 1 Y 0,1 n

Figure 8: Matching two long sequences of possibly null blocks. As all Xand Y -blocks are nullable, the est and let of the X-blocks are never forced to shift. As such, processing the feasible region of Xi requires scanning all Y -blocks.

compatible with Sl,u. Similarly, we may intersect S with the union S of all the bases of the blocks Yj in the feasible region. But where the dashed strings contain many nullable blocks, the feasible region of each Xi may cover all of Y (see Fig. 8). In such cases, performing a complete sweep of the feasible region for each block is too expensive. Instead, we maintain a cache of previously processed feasible regions; before scanning each region, we ﬁrst check the cache, and if present we re-use the previous result. Rather than merely pruning a block B, we may instead replace it with the sequence of blocks constituting the feasible region. However, as explained in Example 3, we must take care when doing so not to lose any cardinality information; otherwise, our propagation may inadvertently increase the domain size. Example 3 Consider matching B = {a, b, c}0,4 against dashed string Y , and ﬁnding the corresponding match to be Z = {a, c}0,2{b, c}3,4 (with mandatory and feasible regions coinciding). Replacing B with Z is tempting, as it provides more precise information about the sequencing of characters. However, the maximum cardinality of Z is greater than that of B, so there are strings in γ(Z) which were not in γ(B). In fact, in this case the search space would be larger after the replacement (e.g., ccccc γ(Z) γ(B)). Even if this were not the case, non-monotonicity may cause problems e.g., constraints which were found to be satisﬁed (and thus safe to deactivate) may no longer be so after this update. We adopt a simple strategy to avoid this behaviour when reﬁning B = Sl,u. Let [l , u ] be the lower and upper cardinality bounds, derived respectively from the mandatory and feasible regions, and S the union of all the blocks of the feasible region. If l > l u > u, or the mandatory region is empty (that is, eet(B) < lst(B)), we restrict ourselves to reﬁne B as we shall see later, without any replacement. Otherwise, it is safe to replace B with the matching range from Y (setting the lower bound of optional blocks to 0). However, this is not necessarily beneﬁcial: replacing S0,10 with S0,1 S0,1 preserves the set of models, but increases the search space by introducing symmetries. We therefore split the feasible region into three parts: the mandatory region M = M1 . . . Mk, and a (possibly empty) preﬁx/sufﬁx on either side. From M we discard all blocks incompatible with B, yielding MB, then use CRUSHS to make the preﬁx and sufﬁx into single blocks P and Q (compatible with B), ﬁnally replacing B with P MB Q.

match positions after CRUSH after CRUSH{a,b}

(a) a2,3 c1,2 b1,1 c0,2 a3,4

{b, c}0,3 a1,2 a0,2 b0,1 a1,2 a0,2

(b) a2,3 c1,2 b1,1 c0,2 a3,4

{b, c}0,3 a1,2 b0,1 a1,2

(c) a2,3 c1,2 b1,1 c0,2 a3,4

b1,1 a2,3 b1,1 a2,3

Figure 9: Reﬁning B = {a, b}3,4, given several possible matchings.

Example 4 Recall the earliest matching computed in Example 1. Figure 9 illustrates several possible completions of this matching. In each case, we show two versions of the possible matching: using CRUSH, and the more precise CRUSH{a,b}. In case (a), the computed matches are too weak to perform any ﬁltering. In case (b), computing match cardinality without considering B yields an upper bound of 5, which is too weak to allow ﬁltering. However, discarding blocks incompatible with B reﬁnes this to {b}0,1{a}1,2, which allows us to reduce the maximum cardinality of B to 3. In case (c), the cardinality of the matching is at least as precise as the original block. In this case, we replace B with corresponding matching {b}1,1{a}2,3.

Note that this procedure does not exploit all possibilities for propagation. In case (b), since the maximum cardinality is now 3, we can replace block {a, b}3,4 by {b}1,1{a}2,2.

We implemented the SWEEP algorithm as an extension of GSTRINGS solver. To distinguish this version of G-STRINGS from that introduced in (Amadini et al. 2017), which uses the COVER algorithm for propagating equality, we use the notation G-SWEEP and G-COVER respectively. We ﬁrst evaluated the performance of G-COVER and G-SWEEP approaches by using the same benchmarks of (Amadini et al. 2017), consisting of six problems, namely: anbn, Chunk Split, Hamming Distance, Levenshtein, SQLInjection, String Replace. For each of these problems we derived 5 instances by varying the maximum string length ℓ {250, 500, 1000, 5000, 10000}. In addition to G-COVER and G-SWEEP, we also evaluated state-of-the-art string solvers such as: the SMT solvers CVC4 (Liang et al. 2014) and Z3STR3 (Berzish, Zheng, and Ganesh 2017),2 and the CP solver GECODE+S (Scott et al. 2017). Furthermore, thanks to the Mini Zinc translation to integers deﬁned in (Amadini et al. 2016) statically mapping string variables into arrays of integer variables, we were able to evaluate state-of-the-art constraint solvers over integers such as: GECODE (Gecode 2017), a CP solver on Finite Domains; CHUFFED (Chu 2011), a CP solver with lazy

2We used the last stable releases Z3STR3 1.0.0 and CVC4 1.5.

clause generation (Ohrimenko, Stuckey, and Codish 2009); and IZPLUS (Fujiwara 2016), a CP solver that also exploits local search. Average solving times over (Amadini et al. 2017) benchmark are shown in Table 1 (solving time is set to the timeout T = 600 seconds if a solver can not solve a problem).3 We ignore the model construction time, that for CP solvers using Mini Zinc translations (the ﬁrst three rows in the table) can be too expensive. On the left side of the table, we report the average times by varying the maximum string length ℓ. Clearly G-SWEEP has the best performance, giving an instantaneous answer for all the problems regardless of ℓ. G-COVER still provides good results, although performance deteriorates as ℓ grows. The CP solver GECODE+S follows the same trend, but with worse performance. Although the performance of SMT solvers CVC4 and Z3STR3 is rather independent from ℓ, their effectiveness seems to be worse than the CP string approaches. Unsurprisingly, CP solvers that statically translate into arrays of integer variables are not very effective. This is particularly true when ℓbecomes big. The right side of the table, where results are averaged for each problem class, better explains the reasons for such performance. Both G-COVER and G-SWEEP are instantaneous on all the problems from the NORN (Abdulla et al. 2015) benchmarks, i.e., all the problems except SQLInjection. This problem was introduced in (Amadini et al. 2016) and basically consists in solving the string equation ω = p e b1 = b2 e s where:

ω is a random-generated ﬁxed string of parametric length ℓ, meant to be an input string;

p, s are string variables representing the preﬁx and the sufﬁx of the right hand side of the equation;

= is a ﬁxed string of length 1, representing character = ;

b1, b2 are string variables representing (possibly empty) sequences of white spaces;

e is a string variable representing a non-empty expression.

3We ran the experiments on Ubuntu 15.10 machines with 16 GB of RAM and 2.60 GHz Intel R i7 CPU. The source code is publicly available at (G-Strings 2017)

Table 1: Average results in seconds. Unsatisﬁable problem are marked with *. Best performance in bold font.

ℓ Problems 250 500 1000 5000 10000 anbn* Chunk. Leven. Hamm.* SQL. St.Rep. CHUFFED 1.76 94.46 208.01 312.47 415.04 117.62 361.59 17.8 18.8 471.77 250.51 GECODE 0.63 21.04 89.07 403.14 406.38 269.57 243.14 6.21 5.59 335.25 244.55 IZPLUS 22.55 104.74 106.8 440.22 460.57 244.34 243.49 17.81 186.64 505.9 163.68 Z3STR3 200.83 202.54 200.78 200.65 204.29 600.0 0.28 3.24 0.01 600.0 7.38 CVC4 100.05 100.04 100.05 100.04 100.04 0.02 0.2 0.01 0.01 600.0 0.01 GECODE+S 0.29 2.9 35.66 269.22 371.24 246.23 279.78 28.28 0.0 66.1 194.79 G-COVER 0.01 1.94 5.63 5.22 100.0 0.0 0.0 0.0 0.0 135.35 0.0 G-SWEEP 0.0 0.0 0.0 0.01 0.01 0.0 0.0 0.0 0.0 0.02 0.0

Table 2: Average solving time, in seconds, on new SQL benchmarks. Best performance in bold font.

SAT UNS ALL ℓ 250 500 1000 5000 10000 250 500 1000 5000 10000 250 500 1000 5000 10000 GECODE+S 270.09 251.75 356.96 600.0 600.0 0.17 3.41 130.69 600.0 600.0 135.13 127.58 243.82 600.0 600.0 G-COVER 61.28 285.34 469.96 600.0 600.0 306.32 600.0 600.0 600.0 600.0 183.8 442.67 534.98 600.0 600.0 G-SWEEP 0.01 0.08 0.05 0.52 1.05 0.04 0.35 1.84 69.58 327.92 0.03 0.21 0.95 35.05 164.49

Table 3: Percentage of solved instances on new SQL benchmarks. Best performance in bold font.

SAT UNS ALL ℓ 250 500 1000 5000 10000 250 500 1000 5000 10000 250 500 1000 5000 10000 GECODE+S 55.0 60.0 45.0 0.0 0.0 100.0 100.0 90.0 0.0 0.0 77.5 80.0 67.5 0.0 0.0 G-COVER 90.0 55.0 25.0 0.0 0.0 50.0 0.0 0.0 0.0 0.0 70.0 27.5 12.5 0.0 0.0 G-SWEEP 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

As the name suggests, solving this problem means ﬁnding, if there exists, a potentially dangerous pattern in a SQL query (e.g., the classic string OR 0=0 appended at the end of the query). Note that this problem can not be solved with regular expressions, since the language denoted by p e b1 = b2 e s is not even context free. The main difﬁculty for the SQLInjection problem consists in dealing with a potentially very long ﬁxed string ω. This means that ω is represented by a long sequence of known blocks. As previously mentioned, this is the worstcase scenario for COVER, and indeed its performance can severely degrade when ℓ= |ω| grows. Consequently, for this problem COVER performs worse than both SWEEP and GECODE+S. This behaviour was observed also in (Amadini et al. 2017). Conversely, SWEEP performs very well even for this problem since the sweep-based propagation is able to scan the blocks of the dashed strings in linear time. Unfortunately, as also pointed out in (Amadini et al. 2017; 2016; Scott et al. 2017), there is a lack of challenging string benchmarks. As can be seen from Table 1, the most difﬁcult problem is SQLInjection. We then randomly generated 200 harder instances from this problem; in particular, for each ℓ {250, 500, 1000, 5000, 10000} we generated 20 satisﬁable and 20 unsatisﬁable instances. Results are shown in Table 2 (average solving time, with timeout T = 600 seconds) and Table 3 (percentage of solved instances), where we only compare the three best solvers from the ﬁrst experiment. These harder problems conﬁrm the results of the initial experiments. Since these problem involve long strings GECODE+S is actually better that G-COVER particularly on

unsatisﬁable instances. But G-SWEEP is clearly superior to both, particularly on satisﬁable instances.

Conclusions

We examined the problem of string constraint solving over bounded-length string variables. We started from a recently introduced Constraint Programming approach, based on a restricted class of regular expressions called dashed strings. We improved on previous works by deﬁning SWEEP, a new sweep-based approach for propagating the equality between dashed strings. The strength of SWEEP does not reside in the amount of pruning it performs. Its effectiveness is since it can check satisﬁability in O(n + m), and while ﬁltering is more complex, in practice it is also linear. This allows the approach to scale to dashed strings that consist of a large number of blocks. The SWEEP algorithm, that we implemented as an extension of the G-STRINGS solver, substantially outperforms other state-of-the-art approaches for string constraint solving, whether based on dashed strings or not. There are many future directions for interesting research. Extending the set of constraints supported by the string solver to include constraints such as regular, global cardinality and lexicographic constraints is clearly of interest. Exploring new kinds of search strategy is also important because of the potentially huge search spaces that arise in string problems. Also challenging is to add explanations to dashed string constraint solving in order to make use of nogood learning approaches.

Acknowledgments

This work is supported by the Australian Research Council (ARC) through Linkage Project Grant LP140100437 and Discovery Early Career Researcher Award DE160100568.

Abdulla, P. A.; Atig, M. F.; Chen, Y.; Hol ık, L.; Rezine, A.; R ummer, P.; and Stenman, J. 2015. Norn: An SMT solver for string constraints. In CAV, volume 9206 of LNCS, 462 469. Springer. Aggoun, A., and Beldiceanu, N. 1993. Extending CHIP in order to solve complex scheduling and placement problems. Mathematical and Computer Modelling 17(7):57 73. Amadini, R.; Flener, P.; Pearson, J.; Scott, J. D.; Stuckey, P. J.; and Tack, G. 2016. Minizinc with strings. In LOPSTR 2016, 41 57. Amadini, R.; Gange, G.; Stuckey, P. J.; and Tack, G. 2017. A novel approach to string constraint solving. In CP, Lecture Notes in Computer Science, 3 20. Springer. Berzish, M.; Zheng, Y.; and Ganesh, V. 2017. Z3str3: A string solver with theory-aware branching. Co RR abs/1704.07935. Bisht, P.; Hinrichs, T. L.; Skrupsky, N.; and Venkatakrishnan, V. N. 2011. WAPTEC: Whitebox analysis of web applications for parameter tampering exploit construction. In CCS, 575 586. ACM. Bjørner, N.; Tillmann, N.; and Voronkov, A. 2009. Path feasibility analysis for string-manipulating programs. In TACAS, volume 5505 of LNCS, 307 321. Springer. Chu, G. 2011. Improving Combinatorial Optimization. Ph.D. Dissertation, Department of Computing and Information Systems, University of Melbourne. Emmi, M.; Majumdar, R.; and Sen, K. 2007. Dynamic test input generation for database applications. In ISSTA, 151 162. ACM. Fujiwara, T. 2016. i Zplus. http://www.minizinc.org/ challenge2016/description izplus.txt. Accessed Sept 2017. G-Strings. 2017. Source code. Accessed November 2017. https://bitbucket.org/robama/g-strings. Gange, G.; Navas, J. A.; Stuckey, P. J.; Søndergaard, H.; and Schachte, P. 2013. Unbounded model-checking with interpolation for regular language constraints. In TACAS, volume 7795 of LNCS, 277 291. Springer. Gecode. 2017. Source code. www.gecode.org. Accessed Sept 2017. Hooimeijer, P., and Weimer, W. 2012. Str Solve: Solving string constraints lazily. Automated Software Engineering 19(4):531 559. Kiezun, A.; Ganesh, V.; Artzi, S.; Guo, P. J.; Hooimeijer, P.; and Ernst, M. D. 2012. HAMPI: A solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol. 21(4):25:1 25:28.

Li, G., and Ghosh, I. 2013. PASS: String solving with parameterized array and interval automaton. In HVC, volume 8244 of LNCS, 15 31. Springer. Liang, T.; Reynolds, A.; Tinelli, C.; Barrett, C.; and Deters, M. 2014. A DPLL(T) theory solver for a theory of strings and regular expressions. In CAV, volume 8559 of LNCS, 646 662. Springer. Ohrimenko, O.; Stuckey, P. J.; and Codish, M. 2009. Propagation via lazy clause generation. Constraints 14(3):357 391. Saxena, P.; Akhawe, D.; Hanna, S.; Mao, F.; Mc Camant, S.; and Song, D. 2010. A symbolic execution framework for Java Script. In S&P, 513 528. IEEE Computer Society. Scott, J. D.; Flener, P.; Pearson, J.; and Schulte, C. 2017. Design and implementation of bounded-length sequence variables. In CPAIOR, volume 10335 of LNCS, 51 67. Springer. Tateishi, T.; Pistoia, M.; and Tripp, O. 2013. Pathand index-sensitive string analysis based on monadic secondorder logic. ACM Trans. Softw. Eng. Methodol. 22(4):33:1 33:33. Thom e, J.; Shar, L. K.; Bianculli, D.; and Briand, L. C. 2017. Search-driven string constraint solving for vulnerability detection. In ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, 198 208. Trinh, M.; Chu, D.; and Jaffar, J. 2014. S3: A symbolic string solver for vulnerability detection in web applications. In SIGSAC, 1232 1243. ACM. Yu, F.; Alkhalaf, M.; and Bultan, T. 2010. Stranger: An automata-based string analysis tool for PHP. In TACAS, volume 6015 of LNCS, 154 157. Springer.