# compiling_combinatorial_prediction_games__0c802bce.pdf

Compiling Combinatorial Prediction Games

Frederic Koriche 1

In online optimization, the goal is to iteratively choose solutions from a decision space, so as to minimize the average cost over time. As long as this decision space is described by combinatorial constraints, the problem is generally intractable. In this paper, we consider the paradigm of compiling the set of combinatorial constraints into a deterministic and Decomposable Negation Normal Form (d DNNF) circuit, for which the tasks of linear optimization and solution sampling take linear time. Based on this framework, we provide efﬁcient characterizations of existing combinatorial prediction strategies, with a particular attention to mirror descent techniques. These strategies are compared on several real-world benchmarks for which the set of Boolean constraints is preliminarily compiled into a d DNNF circuit.

1. Introduction

Combinatorial optimization is a important topic of computer science and discrete mathematics, with a wide spectrum of applications ranging from resource allocation and job scheduling, to automated planning and conﬁguration softwares. A common problem is to minimize a modular loss function ℓover a discrete space S {0, 1}d of feasible solutions represented in a concise manner by a set of combinatorial constraints. In the ofﬂine version of this problem, all information necessary to deﬁne the optimization task is available beforehand, and the challenge is to develop algorithms which are provably or practically better than enumerating all feasible solutions. Contrastingly, in the online version of this problem (Audibert et al., 2014), the objective function ℓis subject to change over time. The challenge here is more acute, since the optimization algorithm is required to perform repeated choices on S so as to minimize their average cost in the long run.

*Equal contribution 1CRIL, CNRS UMR 8188, Universit e d Artois, France. Correspondence to: Frederic Koriche <koriche@cril.fr>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

Conceptually, an online combinatorial optimization problem can be cast as a repeated prediction game between a learning algorithm and its environment (Audibert et al., 2011; 2014). During each trial t, the learner chooses a feasible solution st from its decision set S and, simultaneously, the environment selects a loss vector ℓt [0, 1]d. Then, the learner incurs the loss ℓt, st = Pd i=1 ℓt(i)st(i) and, in light of the feedback provided by its environment, updates its strategy in order to improve the chance of selecting better solutions on subsequent trials.

Several classes of combinatorial prediction games can be distinguished, depending on the type of decision set, and the type of observed feedback. In this paper, we focus on full information games in which it is assumed that the feedback supplied at trial t by the environment is the entire vector ℓt. On the other hand, we make very few assumptions about the decision set: S may be described by an arbitrary SAT formula, that is, any set of combinatorial constraints representable by Boolean clauses. As SAT encodings of discrete solution spaces are frequently used in academic and industrial applications (Biere et al., 2009), our setting covers an important class of combinatorial prediction games.

As usual, the performance of an online learning algorithm is measured according to two metrics. The ﬁrst, called regret, measures the difference in cumulative loss between the algorithm and the best solution in hindsight. In this study, we make no assumption about the sequence of loss vectors; in particular ℓt may depend on the previous decisions s1, , st 1 made by the learner. In such non-oblivious or adversarial environments, the learner is generally allowed to make decisions in a randomized way, and its predictive performance is measured by the expected regret:

The second metric is computational complexity, i.e. the amount of resources required to compute st at each round t, given the sequence of feedbacks observed so far.

Related Work. In the literature of combinatorial prediction games, three main strategies have been proposed to attain an expected regret that is sublinear in the game horizon T and polynomial in the input dimension d. The

Compiling Combinatorial Prediction Games

ﬁrst, and arguably simplest strategy, is to Follow the Perturbed Leader (FPL): on each trial t, the learner draws at random a perturbation vector zt Rd, and then selects in S a minimizer of ηLt + zt, where η (0, 1] is a step-size parameter, and Lt is the cumulative loss Lt = ℓ1 + +ℓt 1. Based on the pioneering work of Hannan (1957), reﬁned in (Hutter & Poland, 2005; Kalai & Vempala, 2005), the

FPL algorithm achieves an expected regret of O(d 3 2

The second strategy is based on the popular exponentially weighted average forecaster in the framework of prediction with expert advice (Cesa-Bianchi & Lugosi, 2006). The overall idea is to maintain a weight for each feasible solution s S, which decays exponentially according to the estimated cumulative loss of s. Speciﬁcally, on each trial t, the learner draws a solution st S at random from the exponential family pt(s) exp( η Lt, s ). This strategy, referred to as Expanded Hedge (EH) in (Koolen et al., 2010), attains an expected regret of O(d 3 2

Finally, the third strategy is to Follow the Regularized Leader, a paradigm often advocated in online convex optimization (Hazan, 2016). Here, the learner operates on the convex hull of S, denoted conv(S). On each trial t, the learner starts by choosing a point pt conv(S) that minimizes η ˆLt, p + F(p), where F is a regularization function. Next, pt is decomposed as a convex composition of feasible solutions in S, and then, a decision st is picked at random according to the resulting distribution. For modular loss functions, this strategy is equivalent to the Online Stochastic Mirror Descent (OSMD) algorithm (Audibert et al., 2014; Rajkumar & Agarwal, 2014), which iteratively performs a gradient descent in the dual space of conv(S) under F, and projects back to the primal space according to the Bregman divergence deﬁned from F. Notably, when F is the Euclidean regularizer, OSMD coincides with the popular stochastic gradient descent (SGD) algorithm (Robbins & Monro, 1951). Alternatively, when F is the entropic regularizer, OSMD corresponds to the Component Hedge (CH) algorithm (Koolen et al., 2010), which achieves an optimal expected regret of O(d

From the viewpoint of regret, the results outlined above indicate that few improvements remain to be made in full information games. However, we get a different picture if computational considerations are taken into account: all aforementioned algorithms rely on powerful oracles for making decisions in spaces S represented by combinatorial constraints. Namely, the EH algorithm is required, at each iteration, to sample a solution according to an exponential family over S, a problem which is generally #P-hard (Dyer et al., 2009). Similarly, the FPL strategy has to repeatedly solve a linear optimization task over S, which is generally NP-hard (Creignou et al., 2001). For the OSMD algorithm, and its specializations SGD and CH, the computational issue

is exacerbated by the fact that, even if the learner has access to a linear optimization oracle, it still has to perform, at each trial, a Bregman projection step for which the best known algorithms run in O(d6) time (Suehiro et al., 2012).

Although combinatorial prediction games are generally intractable, efﬁcient implementations of sampling and optimization oracles may be obtained for several decision sets S. For example, when the feasible solutions in S coincide with the bases of a binary matroid, or the perfect matchings of a bipartite graph, linear optimization can be performed in polynomial time, and tractable forms of FPL and OSMD may be derived (Helmbold & Warmuth, 2009; Koolen et al., 2010; Takimoto & Hatano, 2013; Rajkumar & Agarwal, 2014). On the other hand, when the feasible solutions in S correspond to the paths or multi-paths of a rooted Directed Acyclic Graph (DAG), the sampling oracle may be implemented by the weight pushing technique (Mohri, 1998), that recursively evaluates the partition function of an exponential family over the edges of the input DAG. Based on this technique, tractable forms of EH can be derived (Takimoto & Warmuth, 2003; Rahmanian & Warmuth, 2017).

Our Results. Viewing feasible solutions as paths in a DAG is only one of many abstractions that have been proposed in the literature of circuit complexity for representing combinatorial spaces. In the related ﬁeld of knowledge compilation (Darwiche & Marquis, 2002), various classes of Boolean circuits have been identiﬁed, each associated with a set of inference tasks which can be performed in polynomial time. These theoretical results naturally motivate the following question: can we compile a set of constraints representing a combinatorial space S into a compact and Boolean circuit for which both solution sampling and linear optimization are tractable? By viewing the compilation process as a pre-processing step , we may get for free efﬁcient implementations of sampling and optimization oracles, provided that the size of the resulting circuit is not too large.

The present study aims at solving combinatorial prediction games, by compiling decision sets into deterministic Decomposable Negation Normal Form (d DNNF) circuits (Darwiche, 2001). This class comes with generic compilers which take as input a SAT formula representing a decision set S, and return a d DNNF circuit C that encodes S (Darwiche, 2002; Lagniez & Marquis, 2017). Although the size of C may grow exponentially in the treewidth of the input formula, it is usually much smaller in practice; existing compilers are able to compress combinatorial spaces deﬁned over thousands of variables and constraints.

With these compilation tools in hand, our contributions are threefold: (i) we show that for d DNNF circuits, the sampling oracle in EH and the linear optimization oracle in

FPL, run in linear time using a simple variant of the weight-

Compiling Combinatorial Prediction Games

pushing technique; (ii) for the SGD and CH strategies, we develop a Bregman projection-decomposition method that uses O(d2 ln(d T)) calls to the linear optimization oracle; (iii) we experimentally show on online conﬁguration and planning tasks that EH and FPL are fast, but our variants of SGD and CH are more efﬁcient to minimize the empirical regret.

Before proceeding to the core of the paper, we emphasize that the compilation approach to online optimization is not entirely new. Recently, Sakaue et. al. (2018) used the class of Ordered Binary Decision Diagrams (OBDDs) (Bryant, 1986) for implementing the EWA forecaster in combinatorial bandits. Here, S is described by a graph over d edges, together with a constraint specifying the type of objects we desire (e.g. paths or cliques). By contrast, our study assumes that S is described with an arbitrary set of Boolean constraints. So, both studies are targeting different classes of combinatorial prediction games. Moreover, it is known that d DNNF is strictly more succinct than OBDD (Darwiche & Marquis, 2002). Namely, any OBDD can be transformed in linear time and space into an equivalent d DNNF circuit, but the converse is not true: d DNNF includes simple circuits which require an exponential size representation in OBDD. In fact, the key point of compiling combinatorial prediction games is to use both tractable and succinct languages, for allowing prediction strategies to be efﬁcient on a wide variety of combinatorial domains.

2. Tractable Inference via Compilation

For the combinatorial prediction games considered in this paper, we assume that the input decision space S is deﬁned from a set of n binary-valued attributes, and we use X = {x1, , xd}, where d = 2n, to denote the set of all attribute-value pairs, called literals. A solution is a vector s {0, 1}d such that s(i) + s(j) = 1 for every pair of distinct literals xi, xj X deﬁned on the same attribute. Thus, s 1 = n for any feasible solution s S.

An NNF circuit over X is a rooted DAG, whose internal nodes are labeled by (or-node) or (and-node), and whose leaves are labeled by either a literal in X, or a constant in {0, 1}. The size of C, denoted |C|, is given by the number of its edges. The set of attributes occurring in the subgraph of C rooted at some node c is denoted att(c).

For the sake of clarity, we assume that any NNF circuit C satisﬁes two basic properties, namely (i) any internal node c in C has exactly two children, denoted cl and cr, and (ii) att(cl) = att(cr) = for any or-node c of C. An NNF circuit satisfying both conditions is called smooth. As shown in (Darwiche, 2001), any Boolean circuit C can be transformed in to an equivalent smooth NNF circuit of size linear in |C|.

By viewing literals as input gates , and nodes as output gates , we may specify various inference tasks on Boolean

Q R maxmin {0, 1} max min 1 0 minsum R {+ } min + 0 + sumprod R + 1 0

Table 1: Commutative semirings

circuits, depending on the type of input values and the semantics of nodes. As suggested by Friesen & Domingos for sum-product functions (2016), inference tasks can be captured through semiring operations. To this point, recall that a commutative semiring is a tuple (R, , , , ) such that R is a set including the elements and , is a associative and commutative binary operation on R with identity element , is an associative binary operation on R with identity element and absorbing element , and the operator left and right distributes over the operator .

Inference tasks on an NNF circuit C are deﬁned using a commutative semiring Q = (R, , , , ) and an input vector w Rd. The output of a node c in C for Q given w is denoted Q(c|w), and recursively deﬁned by

w(i) if c is the literal xi, if c is the constant 1, if c is the constant 0, Q(cl |w) Q(cr |w) if c is a node , and Q(cl |w) Q(cr |w) if c is a node

By Q(C |w), we denote the output of the root of C for Q given w. Of particular interest in this study are the semirings described in Table 1; maxmin, minsum, and sumprod, and used to capture the inference tasks of model checking, linear optimization, and model sampling, respectively.

2.1. Model Checking

Given an NNF circuit C over X, the task of model checking is to decide whether a Boolean input s {0, 1}d is true in C according to the propositional semantics of nodes. Obviously, s is a model of C iff maxmin(C | s) = 1, which can be determined in O(|C|) time. An NNF circuit C is called a representation of a set of feasible solutions S {0, 1}d

if sol(C) = S, where sol(C) is the set of models of C.

Apart from model checking, virtually all inference tasks in NNF circuits are NP-hard. Indeed, the NNF language covers the class of SAT formulas. So, we need to reﬁne this class in order to get tractable forms of optimization and sampling.

Compiling Combinatorial Prediction Games

2.2. Decomposability and Optimization

A Boolean circuit C is decomposable if for every and-node c of C, we have att(cl) att(cr) = . The class of decomposable NNF circuits is denoted DNNF. For such circuits, which are similar to Boolean sum-product networks (Poon & Domingos, 2011), we can get an efﬁcient implementation of the linear optimization oracle.

Proposition 1. Let S {0, 1}d be a (nonempty) decision set represented by a DNNF circuit C, and let w Rd be a modular objective. Then, ﬁnding a minimizer of w in S can be done in O(|C|) time.

Proof. Based on the minsum semiring, we have

min s S w, s = min s sol(C) w, s = minsum(C |w)

This observation suggests a two-pass weight pushing method for ﬁnding a minimizer s of w in S in O(|C|) time. Given a topological ordering of C, the ﬁrst pass stores the value Q(c | w) of each node c C, using Q = minsum. The second pass performs a top-down search over C, by selecting all children of a visited and-node, and by selecting exactly one child c {cl, cr} of a visited or-node c such that Q(c | w) = Q(c | w). Let T be the corresponding search tree, and let s {0, 1}d be the indicator vector of the set of literals occurring in T. By construction, we have Q(T |w) = Q(C |w), which implies that s is a minimizer of w. Since S is not empty, we know that Q(C |w) < + . This, together with the fact that minmax(T | s) = 1 whenever Q(T |w) < + , implies that s S.

2.3. Determinism and Sampling

As the problem of counting the number of models in a DNNF circuit is #P-hard (Darwiche & Marquis, 2002), we need to reﬁne this class in order to get an efﬁcient implementation of the sampling oracle. To this end, an NNF circuit C is called deterministic if minmax(cl | s) + minmax(cr | s) 1 for every or-node c C and every feasible solution s. The class of deterministic DNNF circuits is denoted d DNNF.

Proposition 2. Let S {0, 1}d be a decision set represented by a d DNNF circuit C, and for a vector w Rd, let Pw be the exponential family on S given by:

Pw(s) = exp w, s P

s S exp w, s

Then, sampling s Pw can be done in O(|C|) time.

Proof. Based on the sumprod semiring, we have

Pw(s) = exp w, s P

s sol(C) exp w, s = exp w, s sumprod(C |w )

x1 x2 x1 x2

x1 x2 x1 x2

Figure 1: A smoothed DNNF circuit C (left), and a d DNNF circuit C (right). Here, xi and xi are distinct literals sharing the same attribute. Note that C can be smoothed by simply replacing x3 with x3 (x4 x4), and using a symmetric substitution for x4.

where w = (ew(1), , ew(d)). Again, such an equivalence suggests a two-pass weight pushing method for sampling a solution s according to Pw in O(|C|) time. Using a topological ordering of C, the ﬁrst pass stores the values Q(c|w ), where Q = sumprod. The second pass performs a top-down randomized search over C, by selecting all children of a visited and-node, and by drawing at random one of the children of a visited or-node c according to the distribution p(cl) = Q(cl | w )/Q(c | w ) and p(cr) = 1 p(cl). Let T be the tree of visited nodes, and s be the indicator vector of the literals in T. Since S = , we must have Q(C |w) > 0. Thus, each Bernoulli test performed in T is valid, and hence, s S. For any literal xi occurring in T, let p(xi) denote the probability of the (unique) path connecting the root to xi. By a telescoping product of Bernoulli distributions, we get that p(xi) = ew(i)/Q(C |w ). Therefore, p(s) = Q

i:s(i)=1 p(xi) = Pw(s), as desired.

We close this section by highlighting some interesting subclasses of d DNNF. A decision node is an or-node of the form (xi c l) (xi c r), where xi and xi are opposite literals, and c l and c r are arbitrary nodes. The class of Free Binary Decision Diagrams (FBDD) is the subset of d DNNF in which every or-node is a decision node, and at least one child of any and-node is a literal (Wegener, 2000). For example, if in the d DNNF circuit of Figure 1, we replace the or-node (in blue) by a simple literal, say x3, then we get an FBDD circuit. The family of Ordered Binary Decision Diagrams (OBDD) is the subclass of FBDD obtained by imposing a ﬁxed ordering on the decision variables. Alternatively, the well-known family of (Binary) Decision Trees (DT) is the subclass of FBDD circuits for which the primal graph is cycle-free. Since all these classes are (strict) subsets of d DNNF, they admit linear-time algorithms for linear optimization and model sampling.

3. Tractable Prediction via Compilation

After an excursion into compilation languages, we are now ready to provide efﬁcient characterizations of combinatorial prediction strategies. Our results are summarized in Table 2.

Compiling Combinatorial Prediction Games

Algorithm Regret Runtime

FPL O(d 3 2

SGD+PCG O(d(

T + ln T)) O(d2|C|ln(d T))

δ-CH+PCG O(d(

T + ln T)) O d2|C|

Table 2: Expected regrets and per-round running times of combinatorial prediction strategies, implemented on a d DNNF circuit C.

Notably, using the fact that s 1 = d/2, the regret bounds for EH and FPL can easily be derived from (Audibert et al., 2011) and (Hutter & Poland, 2005), respectively. Both strategies are straightforward to implement on d DNNF circuits. Indeed, recall that EH draws, at each trial t, a feasible solution st S at random according to the distribution P ηLt, where Lt = ℓ1 + + ℓt 1. So, by direct application of Proposition 2, this strategy runs in O(|C|) time per round, using a d DNNF representation C of the decision set S. For the FPL strategy, each round t is performed by choosing a minimizer st S of the objective function ηLt zt, where zt Rd is a perturbation vector whose components are independent exponentially distributed random variables. By Proposition 1, the FPL strategy also runs in O(|C|) time per round, using a d DNNF encoding C of S, and the fact that |C| is in Ω(d).

However, the OSMD strategy and its specializations, SGD and CH, require more attention, due to the projectiondecomposition step involved at each iteration.

3.1. Online Stochastic Mirror Descent

The overall idea of Online Mirror Descent (OMD) is to follow the regularized leader through a primal-dual approach (Nemirovski & Yudin, 1983; Beck & Teboulle, 2003). Let K be a convex set, and let int(K) denotes its interior. Given a regularization function F deﬁned on K, OMD iteratively performs a gradient descent in the interior of the dual space K , and projects back the dual point into the primal space K. The connection between K and K

is ensured using the gradients F and F , where F is the convex conjugate of F, deﬁned on K . The projection step is captured by the Bregman divergence of F, which is a function BF : K int(K) R given by:

BF (p, q) = F(p) F(q) F(q), p q

In the stochastic variant of OMD, introduced by Audibert et. al. (2011; 2014), and speciﬁed in Algorithm 1, each projection is performed onto the subset conv(S) of K, and the resulting point pt is decomposed into a convex combination of feasible solutions in S, from which one is picked at random for the prediction task.

Algorithm 1 OSMD

Input: decision set S {0, 1}d, horizon T Z+ Parameters: regularizer F on K conv(S), step-size η (0, 1]

set u1 = 0 for t = 1 to T do

set pt Argminp conv(S) BF (p, F (ut)) play st pt and observe ℓt set ut+1 = F(pt) ηℓt end for

For common regularizers, the gradient F(pt) and its dual F (ut) are easily calculable, and we shall assume that the time spent for their construction is negligible compared with the running time of the linear optimization oracle. In fact, the computational bottleneck of OSMD is to ﬁnd a minimizer pt of BF (p, F (ut)) in the convex hull of S, and to decompose pt into a convex combination of solutions in S. Fortunately, under reasonable assumptions about the curvature of BF , this projection-decomposition step can be efﬁciently computed, using recent results in projection-free convex optimization algorithms.

To this end, we need additional deﬁnitions. For a convex set K, a differentiable function f : K R is called α-strongly convex with respect to a norm if

f(p ) f(p) f(p), p p + α

Furthermore, f is called β-smooth1 with respect to if

f(p ) f(p) f(p), p p + β

Based on these notions, we say that a Bregman divergence BF has the condition number β/α if BF is both α-strongly convex and β-smooth with respect to the Euclidean norm 2 in its ﬁrst argument. For such regularizers, the next result states that the projection-decomposition step can be approximated in low polynomial time, by exploiting the Pairwise Conditional Gradient (PCG) method, a variant of the Frank-Wolfe convex optimization algorithm, whose convergence rate has been analyzed in (Lacoste-Julien & Jaggi, 2015; Garber & Meshi, 2016; Bashiri & Zhang, 2017).

Lemma 1. Let S {0, 1}d be a decision set represented by a d DNNF circuit C, and F be a regularizer on K conv(S) such that BF has condition number β/α. Then, for any q int(K) and ϵ (0, 1), one can ﬁnd in O( β

αd2|C|ln βd

ϵ ) time a convex decomposition of p conv(S) such that

BF (p, q) min p conv(C) BF (p , q) ϵ

1This notion of geometric smoothness should not be confused with the structural smoothness of NNF circuits in Section 2.

Compiling Combinatorial Prediction Games

Algorithm 2 PCG

Input: S {0, 1}d, f : K R, m Z+ Parameters: step-sizes {ηj}m j=1

let p1 be some point in S for j = 1 to m do

let Pj i=1 αisi be the convex decomposition of pj set s+ j Argminp conv(S) f(pj), p set s j Argmins {s1, ,sj} f(pj), s set pj+1 = pj + ηj(s+ j s j ) end for

Proof. Observe that conv(S) is a simplex-like polytope (Bashiri & Zhang, 2017), deﬁned by the linear constraints p 0, PN i=1 αisi = p, α 0, and PN i=1 αi = 1, where N = |S|. So, conv(S) and BF satisfy the conditions of Theorem 1 in (Garber & Meshi, 2016), and using the step-sizes advocated by the authors, we get that

BF (pm, q) BF (p , q) βd

2 exp α 8βd2 m

where pm is the point obtained at the last iteration of PCG, and p is the (unique) minimizer of BF (p, q) on conv(S). Therefore, after m (8d2β/α) ln(βd/(2ϵ)) iterations, we have BF (pm, q) BF (p , q) ϵ. Finally, since each iteration of PCG makes one call to the linear optimization oracle, the runtime complexity follows from Proposition 1.

By OSMD+PCG, we denote the reﬁned version of the OSMD algorithm that uses the PCG method at each trial t in order to approximate the Bregman projection-decomposition step. In addition to a regularizer F and a step-size η, OSMD+PCG takes as parameters a sequence {ϵt}T t=1 such that

BF (pt, qt) BF (p t , qt) ϵt

where pt is the point returned by PCG, qt = F (ut), and p t is the minimizer of BF (p, qt) over conv(S).

Theorem 1. Suppose that OSMD+PCG takes as input a d DNNF representation C of a decision set S {0, 1}d, and a horizon T, and uses a regularizer F on K conv(S) such that BF has condition number β/α, together with a stepsize η (0, 1] and a sequence of {ϵt}T t=1 such that ϵt = γ/t2

for γ > 0. Then, OSMD+PCG attains the expected regret

α (ln T + 1) + 1

η max s S BF (s, p 1)

t=1 BF ( F(p t ) ηℓt, F(pt))

with a per-round running time in O β

αd2|C|ln βd T

Proof. Let s S be the optimal solution chosen with the beneﬁt of hindsight. By decomposing the regret, we have

t=1 ℓt, p t s +

t=1 E ℓt, st p t (2)

By Theorem 2 in (Audibert et al., 2014), the ﬁrst term in (2) is bounded by the last two terms in (1). For the second term in (2), we get from the Cauchy-Schwarz inequality that

E ℓt, st p t ℓt 2 pt p t 2

Moreover, by applying the Generalized Pythagorean Theorem (Cesa-Bianchi & Lugosi, 2006), we know that BF (p, qt) BF (p, p t ) + BF (p t , qt), for any p conv(S). Using p = pt and rearranging,

BF (pt, p t ) BF (pt, qt) BF (p t , qt) ϵt (3)

Since BF is α-strongly convex with respect to 2 in its ﬁrst argument, we also have α

2 pt p t 2 2 BF (pt, p t ). Thus by plugging this inequality into (3), we get that E ℓt, st p t p

2dϵt/α. Finally, by substituting ϵt with ρ/t2, summing other T, and applying the logarithmic bound on harmonic series, we obtain the desired result.

3.2. Stochastic Gradient Descent

The (online) SGD algorithm is derived from OSMD using the Euclidean regularizer F(p) = 1

2 p 2 2. In this simple framework, the primal and dual spaces coincide with Rd, and hence, F (u) = u, F(p) = p, and F (u) = u. Furthermore, BF has the condition number 1/1, since BF (p, q) = 1 2 p q 2 2. We denote by SGD+PCG the instance of OSMD+PCG deﬁned on the Euclidean regularizer.

Proposition 3. The SGD+PCG algorithm achieves an expected regret bounded by d(

T + ln T + 1) with a per-round runtime complexity in O(d2|C|ln(d T)) using η = 1/

T and γ = d/2.

Proof. This simply follows from Theorem 1, together with the fact that maxs S BF (s, p 1) d and ℓt 2 2 d.

3.3. Component Hedge

The CH algorithm is derived from OSMD using the entropic regularizer F(p) = Pd i=1 p(i)(ln p(i) 1), for which the conjugate is F (u) = Pd i=1 exp u(i). Here, we cannot ﬁnd a ﬁnite condition number for the associated divergence BF (p, q) = Pd i=1 p(i) ln p(i)

q(i) (p(i) q(i)), since its gradient is unbounded. This issue may, however, be circumvented using a simple trick advocated in (Krichene et al., 2015), which consists in replacing the entropic regularizer with the function Fδ(p) = F(p + δ), where

Compiling Combinatorial Prediction Games

δ (0, 1) and δ = (δ, , δ). For this function, the primal space is ( δ, + ), and since F δ (u) = F (u) u, δ , the dual space is Rd. It is easy to show that

p(i) = ln(p(i) + δ) F δ (u) u(i) = eu(i) δ

BFδ(p, q) = BF (p+δ, q+δ) B Fδ(u, v) = B F (u, v)

where B F (u, v) = Pd i=1 ev(i)(ev(i) u(i) +v(i) u(i) 1). Furthermore, since the ﬁrst and second order partial derivatives of B Fδ(p, q) at the coordinate p(i) are

p(i) = ln p(i) + δ

q(i) + δ 2BFδ(p, q)

2p(i) = ln 1 p(i) + δ

it follows that BFδ has the condition number 1+δ/δ. Indeed, given an arbitrary point q int( δ, + ), let Hq(p) denote the the Hessian matrix of BFδ(p, q) at p conv(S). Then, for any z Rd, the diagonal entries of Hq(p) satisfy

1 1 + δ 2BF (p, q)

2p(i) z(i)2 1

using the fact that p(i) [0, 1]. Thus, αI Hq(p) βI for α = 1/1+δ and β = 1/δ. In what follows, the instance of

OSMD+PCG that uses Fδ as regularizer is called δ-CH+PCG.

Proposition 4. The δ-CH+PCG algorithm achieves an expected regret bounded by d(1 + 2δ)(

T + ln T + 1) with a per-round runtime complexity in O d2|C|/δ ln d T/δ

using η = 1/

T and γ = 2d(1/2 + δ)/(1 + δ).

Proof. The runtime complexity simply follows from Theorem 1. The regret bound is obtained by bounding the second and third terms of (1), and using the above values for η and γ. Using s 1 as a maximizer of the second term of (1), we have BFδ(s 1, p 1) = Fδ(s 1) Fδ(p 1). Using the notation p1 = p 1 + δ and r = d(1/2 + δ), we get that

Fδ(s 1) Fδ(p 1)

i=1 p1(i) ln 1 p1(i) r ln d

which is bounded by r. For the third term of (1), observe that Fδ is 1 (1+δ)d-strongly convex with respect to the norm 1, since p p 2 1 d p p 2 2. By Theorem 3 in (Kakade et al., 2012), it follows that F δ is (1 + δ)d-smooth with respect to the norm . Therefore,

1 η BF ( F(p t ) ηℓt, F(pt)) η

2d(1 + δ) ℓt 2

which is bounded by ηr.

4. Experiments

In order to evaluate the performance of the different online combinatorial optimization strategies examined in Section 3, we have considered 16 instances of the SAT Library,2

described in Table 3. Namely, the ﬁrst six rows of the table are (car) conﬁguration tasks, while the remaining rows are planning problems. In the ﬁrst four columns of the table are reported the name of the SAT instance, the number of attributes (d/2), the number of constraints (|SAT|), and the number |S| of feasible solutions. We have used the recent D4 compiler 3 (Lagniez & Marquis, 2017) for transforming SAT instances into d DNNF circuits. The size |C| of the compiled circuit is reported in the ﬁfth column.

In order to simulate combinatorial prediction games, we have used the following protocol. Suppose that the set X = {x1, , xd} of literals is sorted in a lexicographic way, so that for each odd integer i, the pair (xi, xi+1) encodes both conﬁgurations of the same binary attribute. First, we construct a vector µ0 of d/2 independent Bernoulli variables. At each round t {1, , T}, µt is set to µt 1 with probability 0.9, or picked uniformly at random from [0, 1] d/2 with probability 0.1. Then, the feedback supplied to the learner is a vector ℓt {0, 1}d such that ℓt(i) + ℓt(i + 1) = 1, and ℓt(i) = 1 with probability µt(i+1/2) for each odd integer i. So, ℓt(i + 1) = 1 with probability 1 µt(i+1/2). Although this protocol is essentially stochastic, the environment secretly resets µt with probability 0.1 at each round to foil the learner.

The combinatorial prediction strategies were implemented in C++ and tested on a six-core Intel i7-5930K with 32 Gi B RAM.4 For the FPL and EH algorithms, we used the step-size η reported in (Audibert et al., 2011) and (Hutter & Poland, 2005), respectively. Concerning the SGD+PCG and δ-CH+PCG algorithms, we used for η and γ the values determined by our theoretical analysis; the step-sizes {ηt} of PCG were computed from binary search as advocated by Garber & Meshi (2016) in their experiments, and the value of δ was ﬁxed to 1/ln d in order to keep a quadratic runtime complexity for δ-CH+PCG. Finally, the horizon T was set to 103, and a timeout of one day was ﬁxed for learning.

In our experiments, the regret is measured by the difference in cumulative loss between the algorithm and the best feasible solution in hindsight, which is obtained using the linear optimization oracle at horizon T. This measure is averaged on 10 simulations, and divided by T to yield an average empirical regret. Similarly, the per-round runtimes (in seconds) are averaged on 10 simulations. The corresponding results are reported in the last four columns of Table 3.

2www.cs.ubc.ca/ hoos/SATLIB/ 3www.cril.univ-artois.fr/KC/d4.html 4www.github.com/frederic-koriche/ccpg.git

Compiling Combinatorial Prediction Games

Name d/2 |SAT| |S| |C| EH FPL SGD+PCG δ-CH+PCG c140-fc 1828 4267 5.74 10151 6.94 105 164 25 (22s) 215 48 (21s) - - c163-fw 1815 3580 2.97 10140 8.93 105 98 27 (24s) 112 44 (23s) - - c169-fv 1411 637 3.22 1015 7.20 102 3 1 (<1s) 5 2 (<1s) 1 0.2 (1s) 1 0.2 (1s) c211-fs 1635 2536 1.37 1067 1.75 103 12 3 (<1s) 12 5 (<1s) 9 2 (1s) 8 2 (2s) c250-fv 1465 1050 1.20 1037 1.82 102 11 3 (<1s) 16 4 (<1s) 9 2 (1s) 9 1 (1s) c638-fvk 1761 1893 8.83 10121 5.45 103 104 15 (<1s) 127 28 (<1s) 68 4 (2s) 62 4 (4s) 4-step 165 396 8.34 104 1.29 102 3 2 (<1s) 5 3 (<1s) 1 0.8 (<1s) 1 0.8 (<1s) 5-step 177 459 8.13 104 5.80 101 3 1 (<1s) 3 1 (<1s) 1 0.9 (<1s) 1 0.8 (<1s) log-1 939 3742 5.64 1020 9.43 102 46 5 (<1s) 51 8 (<1s) 7 3 (<1s) 7 1 (<1s) log-2 1337 24735 3.23 1010 1.16 104 16 3 (1s) 19 6 (1s) 12 3 (16s) 11 3 (22s) log-3 1413 29445 2.79 1011 4.96 103 19 4 (<1s) 21 7 (<1s) 15 4 (5s) 13 4 (5s) log-4 2303 20911 2.34 1028 9.47 104 77 11 (2s) 94 27 (2s) 19 4 (72s) 17 4 (81s) tire-1 352 1022 8.29 1036 1.37 103 3 1 (<1s) 4 2 (<1s) 1 0.8 (<1s) 1 0.8 (< 1s) tire-2 550 1980 7.39 1011 7.26 102 9 2 (<1s) 12 4 (<1s) 7 2 (<1s) 5 1 (<1s) tire-3 577 1984 2.23 1011 6.31 103 9 4 (<1s) 14 7 (<1s) 7 2 (1s) 5 2 (2s) tire-4 812 3197 1.03 1014 3.97 103 15 3 (<1s) 17 5 (<1s) 8 3 (<1s) 7 3 (1s)

Table 3: Experimental results for online combinatorial optimization strategies on SAT instances encoded into d DNNF circuits.

Here, the symbol indicates that the learner was not able to perform the T rounds in one day. From the viewpoint of regret, SGD+PCG and δ-CH+PCG outperform EH and FPL, which conﬁrms our theoretical results. We mention in passing that SGD+PCG and δ-CH+PCG are remarkably stable. Contrastingly, FPL exhibits a larger variance.

Concerning runtimes, EH and FPL are unsurprisingly faster than SGD+PCG and δ-CH+PCG. Notably, for the hard-to-compile instances c140-fc and c163-fw, both EH and FPL were able to perform each trial in few tens of seconds, while OSMD+PCG algorithms took several minutes per-round (and hence, they were unable to process 103

rounds in one day), due to the time spent in approximating the Bregman projection step. Yet, it is important to emphasize that the convergence rate of PCG is, in practice, much faster than the theoretical bound of O(d2|C|). Both SGD+PCG and δ-CH+PCG were able to process nearly all instances in few seconds per round. For circuits of moderate size, all algorithms run in less than one second per trial. We also observed that SGD+PCG is slightly faster than δ-CH+PCG, especially for large domains where small values of δ have a signiﬁcant impact on the the runtime complexity. In essence, SGD+PCG offers the best compromise between predictive performance and running time; since all feasible solutions are dense ( s 1= d/2), there is no signiﬁcant difference in accuracy between SGD+PCG and δ-CH+PCG.

5. Conclusions

We have proposed a general framework for compiling online combinatorial optimization problems, whose space of feasible solutions is described using a set of Boolean constraints. Namely, we have focused on the class of d DNNF circuits which is endowed with fast inference algorithms for the linear optimization oracle and the sampling oracle. Based on

this framework, we have shown than both EH and FPL admit fast implementation for tackling large scale online combinatorial problems. A particular attention was devoted to the generic OSMD strategy, which involves a computationally expensive projection-decomposition step at each iteration. To this point, we made use of projection-free algorithms, and in particular the PCG method, for approximating this operation. The resulting algorithms, SGD+PCG and δ-CH-PCG, are inevitably slower than EH and FPL, but achieve a better regret performance, as corroborated by our experiments.

We conclude with a few remarks. In light of the current results, a natural perspective of research is to extend our framework to other classes of combinatorial prediction games. Notably, the semi-bandit setting seems within reach. Indeed, the semi-bandit variant of EH, often referred to as EXP2 (Audibert et al., 2014), uses importance weights for estimating the loss at each iteration. By simple adaptation of Proposition 2, such weights can be computed in linear time. Similarly, the semi-bandit extension of FPL exploits the geometric sampling method for estimating loss vectors (Neu & Bart ok, 2016). Again, this iterative method can be implemented in linear time (per iteration) using Proposition 1. Less obvious, however, is the extension of OSMD to semi-bandits: although the extension of CH achieves an optimal expected regret in this setting, its practical use remain limited due to projection-decomposition step. An interesting open question is to determine whether a combination of CH with PCG is able, in the semi-bandit case, to achieve a quasi-optimal regret in low-polynomial time. Of course, the bandit setting is even more challenging. To this point, Sakaue et. al. (2018) have paved the way using OBDDs for an efﬁcient implementation of the COMBBAND algorithm (Cesa-Bianchi & Lugosi, 2012), Extending their approach to d DNNF, which is more succinct than OBDD, is a promising direction of future research.

Compiling Combinatorial Prediction Games

Audibert, J-Y., Bubeck, S., and Lugosi, G. Minimax policies for combinatorial prediction games. In Proceedings of the 24th Annual Conference on Learning Theory (COLT 2011), pp. 107 132, 2011.

Audibert, J-Y., Bubeck, S., and Lugosi, G. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2014.

Bashiri, M. A. and Zhang, X. Decomposition-invariant conditional gradient for general polytopes with line search. In Advances in Neural Information Processing Systems 30, (NIPS 2017), pp. 2687 2697, 2017.

Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operational Research Letters, 31(3):167 175, 2003.

Biere, A., Heule, M., van Maaren, H., and Walsh, T. Handbook of Satisﬁability. IOS Press, 2009.

Bryant, R. Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, C-35 (8):677 691, 1986.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.

Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78(5): 1404 1422, 2012.

Creignou, N., Khanna, S., and Sudan, M. Complexity Classiﬁcation of Boolean Constraint Satisfaction Problems. SIAM Monographs on Discrete Mathematics and Applications, 2001.

Darwiche, A. Decomposable negation normal form. Journal of the ACM, 48(4):608 647, 2001.

Darwiche, A. A compiler for deterministic, decomposable negation normal form. In Proceedings of the 18th National Conference on Artiﬁcial Intelligence, (AAAI 2002), pp. 627 634, 2002.

Darwiche, A. and Marquis, P. A knowledge compilation map. Journal of Artiﬁcial Intelligence Research (JAIR), 17:229 264, 2002.

Dyer, M. E., Goldberg, L. A., and Jerrum, M. The complexity of weighted Boolean #CSP. SIAM Journal of Computing, 38(5):1970 1986, 2009.

Friesen, A. L. and Domingos, P. M. The sum-product theorem: A foundation for learning tractable models. In Proceedings of the 33nd International Conference on Machine Learning, (ICML 2016), pp. 1909 1918, 2016.

Garber, D. and Meshi, O. Linear-memory and decomposition-invariant linearly convergent conditional gradient algorithm for structured polytopes. In Advances in Neural Information Processing Systems 29, (NIPS 2016), pp. 1001 1009, 2016.

Hannan, J. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97 139, 1957.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Helmbold, D. P. and Warmuth, M. K. Learning permutations with exponential weights. Journal of Machine Learning Research, 10:1705 1736, 2009.

Hutter, M. and Poland, J. Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research, 6:639 660, 2005.

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865 1890, 2012.

Kalai, A. T. and Vempala, S. Efﬁcient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005.

Koolen, W. M., Warmuth, M. K., and Kivinen, J. Hedging structured concepts. In Proceedings of the 23rd Conference on Learning Theory (COLT 2010), pp. 93 105, 2010.

Krichene, W., Krichene, S., and Bayen, A. M. Efﬁcient bregman projections onto the simplex. In Proceedings of the 54th IEEE Conference on Decision and Control, (CDC 2015), pp. 3291 3298, 2015.

Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems 28 (NIPS 2015), pp. 496 504, 2015.

Lagniez, J-M. and Marquis, P. An improved decision-dnnf compiler. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, (IJCAI 2017), pp. 667 673, 2017.

Mohri, M. General algebraic frameworks and algorithms for shortest-distance problems. Technical Report 981210-10TM, AT & T Labs-Research, 1998.

Nemirovski, A. S and Yudin, D. B. Problem Complexity and Method Efﬁciency in Optimization. J. Wiley and Sons, 1983.

Compiling Combinatorial Prediction Games

Neu, G. and Bart ok, G. Importance weighting without importance weights: An efﬁcient algorithm for combinatorial semi-bandits. Journal of Machine Learning Research, 17:154:1 154:21, 2016.

Poon, H. and Domingos, P. M. Sum-product networks: A new deep architecture. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2011), pp. 337 346, 2011.

Rahmanian, H. and Warmuth, M. K. Online dynamic programming. In Advances in Neural Information Processing Systems 30, (NIPS 2017), pp. 2824 2834, 2017.

Rajkumar, A. and Agarwal, S. Online decision-making in general combinatorial spaces. In Advances in Neural Information Processing Systems 27, (NIPS 2014), pp. 3482 3490, 2014.

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400 407, 1951.

Sakaue, S., Ishihata, M., and Minato, S-I. Efﬁcient bandit combinatorial optimization algorithm with zero-suppressed binary decision diagrams. In Proceedings of the 21st International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2018), 2018.

Suehiro, D., Hatano, K., Kijima, S., Takimoto, E., and Nagano, K. Online prediction under submodular constraints. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory (ALT 2012), pp. 260 274, 2012.

Takimoto, E. and Hatano, K. Efﬁcient algorithms for combinatorial online prediction. In Proceedings of the24th International Conference on Algorithmic Learning Theory (ALT 2013), pp. 22 32, 2013.

Takimoto, E. and Warmuth, M. K. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773 818, 2003.

Wegener, I. Branching Programs and Binary Decision Diagrams: Theory and Applications. Discrete mathematics and applications. SIAM Monographs on Discrete Mathematics and Applications, 2000.