# online_learning_of_kcnf_boolean_functions__18d7784f.pdf

Online Learning of k-CNF Boolean Functions

Joel Veness1, Marcus Hutter2, Laurent Orseau1, Marc Bellemare1

1Google Deep Mind, 2Australian National University {veness,lorseau,bellemare}@google.com, marcus.hutter@anu.edu.au

This paper revisits the problem of learning a k-CNF Boolean function from examples, for ﬁxed k, in the context of online learning under the logarithmic loss. We give a Bayesian interpretation to one of Valiant s classic PAC learning algorithms, which we then build upon to derive three efﬁcient, online, probabilistic, supervised learning algorithms for predicting the output of an unknown k-CNF Boolean function. We analyze the loss of our methods, and show that the cumulative log-loss can be upper bounded by a polynomial function of the size of each example.

1 Introduction In 1984, Leslie Valiant introduced the notion of Probably Approximately Correct (PAC) learnability, and gave three important examples of some non-trivial concept classes that could be PAC learnt given nothing more than a sequence of positive examples drawn from an arbitrary IID distribution [Valiant, 1984]. One of these examples was the class of k CNF Boolean functions, for ﬁxed k. Valiant s approach relied on a polynomial time reduction of this problem to that of PAC learning the class of monotone conjunctions. In this paper, we revisit the problem of learning monotone conjunctions from the perspective of universal source coding, or equivalently, online learning under the logarithmic loss. In particular we derive three new online, probabilistic prediction algorithms that: (i) learn from both positive and negative examples; (ii) avoid making IID assumptions; (iii) suffer low logarithmic loss for arbitrary sequences of examples; (iv) run in polynomial time and space. This work is intended to complement previous work on concept identiﬁcation [Valiant, 1984] and online learning under the 0/1 loss [Littlestone, 1988; Littlestone and Warmuth, 1994]. The main motivation for investigating online learning under the logarithmic loss is the fundamental role it plays within information theoretic applications. In particular, we are interested in prediction methods that satisfy the following power desiderata, i.e. methods which: (p) make probabilistic predictions; (o) are strongly online; (w) work well in practice; (e) are efﬁcient; and (r) have well understood regret/loss properties. Methods satisfying these prop-

erties can be used in a number of principled and interesting ways: for example, data compression via arithmetic encoding [Witten et al., 1987], compression-based clustering [Cilibrasi and Vit anyi, 2005] or classiﬁcation [Frank et al., 2000; Bratko et al., 2006], and information theoretic reinforcement learning [Veness et al., 2011; 2015]. Furthermore, it is possible to combine such online, log-loss predictors using various ensemble methods [Veness et al., 2012b; Mattern, 2013]. The ability to rapidly exploit deterministic underlying structure such as k-CNF relations has the potential to improve all the aforementioned application areas, and brings the universal source coding literature in line with developments originating from the machine learning community. Our contribution in this paper stems from noticing that Valiant s method can be interpreted as a kind of MAP model selection procedure with respect to a particular family of priors. In particular, we show that given n positive examples and their associated d-dimensional binary input vectors, it is possible to perform exact Bayesian inference over the 2d possible monotone conjunction hypotheses in time O(nd) and space O(d) without making IID assumptions. Unfortunately, these desirable computational properties do not extend to the case where both positive and negative examples are presented; we show that in this case exact inference is #P-complete. This result motivated us to develop a hybrid algorithm, which uses a combination of Bayesian inference and memorization to construct a polynomial time algorithm whose loss is bounded by O(d2) for the class of monotone conjunctions. Furthermore, we show how to trade constant loss for logarithmic cumulative loss to get a more practical algorithm, whose loss is bounded by O(d log n). We also give an alternative method, based on the WINNOW algorithm [Littlestone, 1988] for 0/1 loss, which has better theoretical properties in cases where many of the d Boolean inputs are irrelevant. Finally, similarly to Valiant, we describe how to combine our algorithms with a reduction that (for ﬁxed k) enables the efﬁcient learning of k-CNF Boolean functions from examples.

2 Preliminaries Notation. A Boolean variable x is an element of B := { , } = {0, 1}. We identify false with 0 and true with 1, since it allows us to use Boolean functions as likelihood functions for deterministically generated data. We keep the boolean operator notation whenever more sugges-

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

tive. The unary not operator is denoted by , and is deﬁned as : 0 7 1; 1 7 0 ( x = 1 x). The binary conjunction and disjunction operators are denoted by and respectively, and are given by the maps : (1, 1) 7 1; or 0 otherwise (x y = x y). and : (0, 0) 7 0; or 1 otherwise (x y = max{x, y}). A literal is a Boolean variable x or its negation x; a positive literal is a non-negated Boolean variable. A clause is a ﬁnite disjunction of literals. A monotone conjunction is a conjunction of zero or more positive literals. For example, x1 x3 x6 is a monotone conjunction, while x1 x3 is not. We adopt the usual convention with conjunctions of deﬁning the zero literal case to be vacuously true. The power set of a set S is the set of all subsets of S, and will be denoted by P(S). For convenience, we further deﬁne Pd := P({1, 2, . . . , d}). We also use the Iverson bracket notation JPK, which given a predicate P, evaluates to 1 if P is true and 0 otherwise. We also use the notation x1:n and x<n to represent the sequences of symbols x1x2 . . . xn and x1x2 . . . xn 1 respectively. Furthermore, base two is assumed for all logarithms in this paper. Finally, we use the notation ai to index the ith component of a Boolean vector a Bd.

Problem Setup. We consider an online, sequential, binary, probabilistic prediction task with side information. At each time step t N, a d-dimensional Boolean vector of side information at (a1 t, ..., ad t ) Bd is presented to a probabilistic predictor ρt : Bd (B [0, 1]), which outputs a probability distribution over B. A label xt B is then revealed, with the predictor suffering an instantaneous loss of ℓt := log ρt(xt; at), with the cycle continuing ad inﬁnitum. It will also prove convenient to introduce the joint distribution ρ(x1:n; a1:n) := Qn t=1 ρt(xt; at), which lets us express the cumulative loss Ln(ρ) in the form

i=1 ℓt = log ρ(x1:n; a1:n).

We later use the above quantity to analyze the theoretical properties of our technique. As is usual with loss or regret based approaches, our goal will be to construct a predictor ρ such that Ln(ρ)/n 0 as n for an interesting class of probabilistic predictors M. The focus of our attention for the remainder of this paper will be on the class of monotone conjunctions.

Brute force Bayesian learning. Consider the monotone conjunction h S(at) := V i S ai t for some S Pd, classifying at Bd as h S(at) B. This can be extended to the function h S : Bn d Bn that returns the vector h S(a1:n) := (h S(a1), ..., h S(an)). One natural Bayesian approach to learning monotone conjunctions would be to place a uniform prior over the set of 2d possible deterministic predictors that are monotone conjunctions of the d Boolean input variables. This gives the Bayesian mixture model

ξd(x1:n; a1:n) := X

1 2d νS(x1:n; a1:n), (1)

where νS(x1:n; a1:n) := Jh S(a1:n) = x1:n K is the deterministic distribution corresponding to h S. Note that when S = {}, the conjunction V i S ai t is vacuously true. From here onwards, we will say hypothesis h S generates x1:n if h S(a1:n) = x1:n. For sequential prediction, the predictive probability ξd(xt|x<t; a1:t) can be obtained by computing the ratio of the marginals, that is ξd(xt|x<t; a1:t) = ξd(x1:t; a1:t) / ξd(x<t; a<t). Note that this form of the predictive distribution is equivalent to using Bayes rule to explicitly compute the posterior weight for each S, and then taking a convex combination of the instantaneous predictions made by each hypothesis. The loss of this approach for an arbitrary sequence of data generated by some h S for S Pd, can be upper bounded by

Ln(ξd) := log ξd(x1:n; a1:n)

1 2d Jh S(a1:n) = x1:n K

2d Jh S (a1:n) = x1:n K = d.

Of course the downside with this approach is that a naive computation of Equation 1 takes time O(n 2d). Indeed one can show that no polynomial-time algorithm in d for ξd exists (assuming P =NP).

Theorem 1 (ξd is #P-complete). Computing the function f : {0, 1}n d {0, . . . , 2d} deﬁned as f(a1:n) := 2dξd(01:n; a1:n) is #P-complete.

We prove hardness by a two-step reduction: counting independent sets, known to be #P-hard, to computing the cardinality of a union of power sets to computing ξd.

Deﬁnition 2 (UPOW). Given a list of n subsets S1, . . . , Sn of {1, . . . , d}, compute A := |P(S1) P(Sn)|, i.e. the size of the union of the power sets of S1, . . . , Sn.

Lemma 3 (UPOW ξd). If at is deﬁned as the ddimensional characteristic bit vector describing the elements in St, i.e. ai t := Ji St K, then A = 2d[1 ξd(01:n|a1:n)].

Proof. Since h S(at)=1 iff S St iff S P(St) we have

Jh S(a1:n) = 01:n K

t=1 [h S(at) = 0]

t : S P(St) S P(S1) ... P(Sn)

which implies P S Pd νS(01:n|a1:n) = 2d A.

The intuition behind Lemma 3 is that since ξd uses a uniform prior over Pd, the number of hypotheses consistent with the data is equal to 2dξd(01:n|a1:n), and therefore the number of hypotheses inconsistent with the data is equal to 2d[1 ξd(01:n|a1:n)]. One can easily verify that the set of hypotheses inconsistent with a single negative example is It := P i {1, . . . , d} : Jai t = 1K , hence the set of hypotheses inconsistent with the data is equal to | n t=1It|.

Theorem 4 (IS UPOW, Brendan Mc Kay, private communication). UPOW is #P-hard.

Proof. Let G = (V, E) be an undirected graph with vertices V = {1, ..., d} and edges E = {e1, ..., en}, where edges are e = {v, w} with v, w V and v = w. An independent set I is a set of vertices no two of which are connected by an edge. The set of independent sets is IS := {I V : e E : e I}. It is known that counting independent sets, i.e. computing |IS| is #P-hard [Vadhan, 2001]. We now reduce IS to UPOW: Deﬁne St := V \ et for t {1, ..., n} and consider any W V and its complement W = V \ W. Then

W IS e E : e W t : et W t : W St t : W P(St)

W P(S1) ... P(Sn).

Since set-complement is a bijection and there are 2|V | possible W, this implies |IS| + |P(S1) ... P(Sn)| = 2|V |. Hence an efﬁcient algorithm for computing |P(S1) ... P(Sn)| would imply the existence of an efﬁcient algorithm for computing |IS|.

Proof. of Theorem 1. Lemma 3 and Theorem 4 show that f is #P-hard. What remains to be shown is that f is in #P. First consider UPOW function u : Pn d {0, ..., 2d} deﬁned as u(S1, ..., Sd) := A. With identiﬁcation {0, 1}d = Pd via ai t = Ji St K and St = {i : ai t = 1}, Lemma 3 shows that f(a1:n)+u(S1, ..., Sn) = 2d. Since S P(S1) ... P(Sn) iff t : S P(St) iff t : S St, the non-deterministic polynomial time algorithm Guess S Pd and accept iff t : S St has exactly A accepting paths, hence u is in #P. Since this algorithm has 2d paths in total, swapping accepting and non-accepting paths shows that also f is in #P.

One interesting feature of our reduction was that we only required a sequence of negative examples. As we shall see in Section 3, exact Bayesian inference is tractable if only positive examples are provided. Finally, one can also show that the Bayesian predictor ξd obtains the optimal loss.

Proposition 5. There exists a sequence of side information a1:2d B2d d such that for any probabilistic predictor ρt : Bd (B [0, 1]), there exists an S Pd such that h S would generate a sequence of targets that would give L2d(ρ) d.

Proof. Consider the sequence of side information a1:2d B2d d, where ai t is deﬁned to be the ith digit of the binary representation of t, for all 1 t 2d. As

|{x1:2d : x1:2d is generated by an S Pd}| = 2d, (2)

to have L2d(ρ) < for all x1:2d, we need ρ(x1:2d) > 0 for each of the 2d possible target strings, which implies that L2d(ρ) d.

Memorization. As a further motivating example, it is instructive to compare the exact Bayesian predictor to that of a naive method for learning monotone conjunctions that simply memorizes the training instances, without exploiting the

logical structure within the class. To this end, consider the sequential predictor that assigns a probability of

md(xn|x<n; a1:n) = Jxn= l(a1:n, x<n)K if an {at}n 1 t=1 1 2 otherwise

to each target, where l(a1:n, x<n) returns the value of xt for some 1 t n 1 such that an = at. Provided the data is generated by some h S with S Pd, the loss of the above memorization technique is easily seen to be at most 2d. This follows since an excess loss of 1 bit is suffered whenever a new ak is seen, and there are at most 2d distinct inputs (of course no loss is suffered whenever a previously seen ak is repeated). While both memorization and the Bayes predictor suffer a constant loss that is independent of the number of training instances, the loss of the memorization technique is exponentially larger as a function of d.

3 Exact Bayesian learning of monotone conjunctions from positive examples

We now show how exact Bayesian inference over the class of monotone conjunctions can be performed efﬁciently, provided learning only occurs from positive examples x1:n 11:n. Using the generalized distributive law [Aji and Mc Eliece, 2000] we derive an alternative form of Equation 1 that can be straightforwardly computed in time O(nd).

Proposition 6. For all n, d N, for all a1:n Bn d, then

ξd(11:n; a1:n) =

Proof. Consider what happens when the expression

is expanded. We get a sum containing 2d terms, that can be rewritten as

1 2d Jh S(a1:n) = 11:n K

= ξd(11:n; a1:n).

where the second equality follows from Equation 1 and the ﬁrst from

νS(11:n|a1:n) = Jh S(a1:n) = 11:n K

t=1 h S(at) =

On MAP model selection from positive examples. If we further parametrize the right hand side of Proposition 6 by introducing a hyper-parameter α (0, 1) to give

ξα d (11:n; a1:n) :=

we get a family of tractable Bayesian algorithms for learning monotone conjunctions from positive examples. The α parameter controls the bias toward smaller or larger formulas; smaller formulas are favored if α < 1 2, while larger formulas are favored if α > 1 2, with the expected formula length being αd. If we denote the prior over S by wα(S) := α|S|(1 α)d |S|, we get the mixture ξα d (x1:n; a1:n) = P S Pd wα(S)νS(x1:n; a1:n). From this we can read off the maximum a posteriori (MAP) model S n := arg max S Pd wα(S | x1:n; a1:n)

= arg max S Pd wα(S) νS(x1:n|a1:n)

under various choices of α. For positive examples (i.e. x1:n = 11:n), this can be rewritten as

S n = arg max S Pd wα(S)

2, the MAP model S n at time n is unique, and is given by

i {1, . . . , d} :

For α = 1 2, a MAP model is any subset of S n as deﬁned by Equation 4. For α < 1

2, the MAP model is {}. Finally, we remark that the above results allow for a Bayesian interpretation of Valiant s algorithm for PAC learning monotone conjunctions. His method, described in Section 5 of [Valiant, 1984], after seeing n positive examples, outputs the concept V i S n xi; in other words, his method can be interpreted as doing MAP model selection using a prior belonging to the above family when α > 1

A Heuristic Predictor. Next we discuss a heuristic prediction method that incorporates Proposition 6 to efﬁciently perform Bayesian learning on only the positive examples. Consider the probabilistic predictor ξ+ d deﬁned by

ξ+ d (xn|x<n; a1:n) := ξd(x+ <nxn; a+ <nan) ξd(x+ <n; a+ <n) , (5)

where we denote by a+ <n the subsequence of a<n formed by deleting the ak where xk = 0, for 1 k n 1. Similarly, x+ <n denotes to the subsequence formed from x<n by deleting the xk where xk = 0. Note that since ξd(x+ <n0; a+ <nan) = ξd(x+ <n; a+ <n) 1 ξd(1|x+ <n; a+ <nan) , Equation 3 can be used to efﬁciently compute Equation 5. To further save computation, the values of the Vn t=1 ai t terms can be incrementally maintained using O(d) space. Using these techniques, each prediction can be made in O(d) time. Of course the main limitation with this approach is that it ignores all of the information contained within the negative instances. It is easy to see that this has disastrous implications for the loss. For example, consider what happens if a sequence of n identical negative instances are supplied. Since no learning will ever occur, a positive constant loss will be suffered at every timestep, leading to a loss that grows linearly in n. This suggests that some form of memorization of negative examples is necessary.

Discussion. There are many noteworthy model classes (for example, see the work of Willems, Shamir, Erven, Koolen, Gyorgi, Veness et al 1995; 1996; 1997; 1999; 2007; 2011; 2012a; 2012; 2013) in which it is possible to efﬁciently perform exact Bayesian inference over large discrete spaces. The common theme amongst these techniques is the careful design of priors that allow for the application of either the generalized distributive law [Aji and Mc Eliece, 2000] and/or dynamic programming to avoid the combinatorial explosion caused by naively averaging the output of many models.

4 Three efﬁcient, low loss algorithms

We now apply the ideas from the previous sections to construct an efﬁcient online algorithm whose loss is bounded by O(d2). The main idea is to extend the heuristic predictor so that it simultaneously memorizes negative instances while also favoring predictions of 0 in cases where the Bayesian learning component of the model is unsure. The intuition is that by adding memory, there can be at most 2d times where a positive loss is suffered. Moving the α parameter closer towards 1 causes the Bayesian component to more heavily weigh the predictions of the longer Boolean expressions consistent with the data, which has the effect of biasing the predictions more towards 0 when the model is unsure. Although this causes the loss suffered on positive instances to increase, we can show that this effect is relatively minor. Our main contribution is to show that by setting α = 2 d/2d, the loss suffered on both positive and negative instances is balanced in the sense that the loss can now be upper bounded by O(d2).

Algorithm. The algorithm works very similarly to the previously deﬁned heuristic predictor, with the following two modiﬁcations: ﬁrstly, the set of all negative instances is incrementally maintained within a set A, with 0 being predicted deterministically if the current negative instance has been seen before; secondly, the ξd terms in Equation 5 are replaced with ξα d , with α = 2 d/2d. More formally,

ζd(xt|x<t; a1:t) :=

( 1 xt if at A; ξα d (x+ <txt;a+ <tat) ξα d (x+ <t;a+ <t) otherwise. (6)

Pseudocode is given in Algorithm 1. The algorithm begins by initializing the weights and the set of negative instances A. Next, at each time step t, a distribution pt( ; at) over {0, 1} is computed. If at has previously been seen as a negative example, the algorithm predicts 0 deterministically. Otherwise it makes its prediction using the previously deﬁned Bayesian predictor (with α = 2 d/2d) that is trained from only positive examples. The justiﬁcation for Line 7 is as follows: First note that wi is always equal to the conjunction of the ith component of the inputs corresponding to the positive examples occurring before time t, or more formally

τ=1:aτ A aτ,

Algorithm 1 ζd(x1:n; a1:n)

1: wi 1 for 1 i d 2: A {}; α 2 d/2d; r 1 3: for t = 1 to n do 4: Observe at 5: if at A then 6: pt(1; at) 0; pt(0; at) 1

7: else pt(1; at) Qd i=1 (1 α)+αwiai t (1 α)+αwi 8: pt(0; at) 1 pt(1; at) 9: endif 10: Observe xt and suffer a loss of log pt(xt; at) 11: if xt = 1 then 12: for i = 1 to d do wi wi ai t end for 13: else A A {at} 14: endif 15: r pt(xt; at)r 16: end for 17: return r

which by Equation 3 implies

ξα d (x+ <t; a+ <t) =

i=1 [(1 α) + αwi].

Similarly ξα d (x+ <t1; a+ <tat) = Qd i=1[(1 α)+αwiai t], which by Equation 6 for at / A implies

ζd(xt = 1|x<t; a1:t) = Qd i=1[(1 α) + αwiai t] Qd i=1[(1 α) + αwi] = pt(1; at).

Trivially pt(xt; at) = 1 xt = ζd(xt|x<t; a1:t) for at A from Line 6. After the label is revealed and a loss is suffered, the algorithm either updates A to remember the negative instance or updates its weights wi, with the cycle continuing. Overall Algorithm 1 requires O(nd) space and processes each example in O(d) time.

Analysis. We now analyze the cumulative log-loss when using ζd in place of an arbitrary monotone conjunction corresponding to some S Pd.

Lemma 7. For all d N \ {1}, log 1 2 d/2d d.

Proof. We have that

ln(1 e d/ed) ln 1 1 1 + d/ed

= ln 1 + d/ed

d/ed = d + ln 1

The ﬁrst bound follows from e x 1 1+x. The equalities are simple algebra. The last bound follows from 1

ed 1 for d 2. (A similar lower bound ln(1 e d/ed) ln(1 (1 d/ed)) = d ln d shows that the bound is rather tight for large d). Substituting d ; d ln 2 in (7) and dividing by ln 2 proves the lemma.

Theorem 8. If x1:n is generated by a hypothesis h S such that S Pd then for all n N, for all d N \ {1}, for all x1:n Bn, for all a1:n Bn d, we have that Ln(ζd) 2d2.

Proof. We begin by decomposing the loss into two terms, one for the positive and one for the negative instances.

t=1 log ζd(xt|x<t; a1:t)

t [1,n] s.t. xt=1

log ζd(xt = 1|x<t; a1:t)

t [1,n] s.t. xt=0

log ζd(xt = 0|x<t; a1:t)

t [1,n] s.t. xt=1

log ξα d (x+ 1:t; a+ 1:t) ξα d (x+ <t; a+ <t) + X

t [1,n] s.t. xt=0

log ζd(xt = 0|x<t; a1:t)

= log ξα d (x+ 1:n; a+ 1:n) + X

t [1,n] s.t. xt=0

log ζd(xt = 0|x<t; a1:t),

where we have used the notation [1, d] := {1, 2, . . . , d}. The ﬁnal step follows since the left summand telescopes. Next we will upper bound the left and right summands separately. For α (0.5, 1), we have for the left term that

log ξα d (x+ 1:n; a+ 1:n) log α|S |(1 α)d |S |

d log(1 α). (8)

Now, let U := n t [1, n] : xt = 0 Vt 1 i=1(at = ai) o

denote the set of time indices where a particular negative instance is seen for the ﬁrst time and let

denote the indices of the variables not ruled out from the positive examples occurring before time t. Given these deﬁnitions, we have that

ξα d (x+ <t; a+ <t)

S Pd α|S|(1 α)d |S| q h S(a+ <t) = x+ <t y

S P(Dt) α|S|(1 α)d |S|

= (1 α)d |Dt| X

S P(Dt) α|Dt|(1 α)|Dt| |S|

= (1 α)d |Dt|. (10)

and similarly for t U

ξα d (x+ <t0; a+ <ta+)

S Pd α|S|(1 α)d |S| q h S(a+ <tat) = x+ <t0 y (11)

S P(Dt) α|S|(1 α)d |S| Jh S(at) = 0K

α|Dt|(1 α)d |Dt|.

The last inequality follows by dropping all terms in the sum except for the term corresponding the maximally sized conjunction V t Dt xt, which must evaluate to 0 given at, since S Dt and t U. Using the above, we can now upper bound the right term X

t [1,n] s.t. xt=0

log ζd(xt = 0|x<t; a1:t)

t U log ξα d (x+ <t0; a+ <ta+) ξα d (x+ <t; a+ <t)

t U log α|Dt|

t U d log α (d) d 2d log α. (12)

Step (a) follows from the deﬁnition of ζd and U (recall that a positive loss occurs only the ﬁrst time an input vector is seen). Step (b) follows from Equations 10 and 11. Step (c) follows since |Dt| d by deﬁnition. Step (d) follows since there are at most 2d distinct Boolean vectors of side information. Now, by picking α = 2 d/2d, we have from Equation 8 and Lemma 7 that log ξα d (x+ 1:n; a+ 1:n) d2. Similarly, from Equation 12 we have that X

t [1,n] s.t. xt=0

log ζd(xt = 0|x<t; a1:t) d 2d log 2 d/2d = d2.

Thus by summing our previous two upper bounds, we get

Ln(ζd) = log ξα d (x+ 1:n; a+ 1:n)

t [1,n] s.t. xt=0

log ζd(xt = 0|x<t; a1:t) 2d2.

A better space/time complexity tradeoff. Although the loss of Algorithm 1 is no more than 2d2 (and independent of n), a signiﬁcant practical drawback is its O(nd) space complexity. We now present an alternative algorithm which reduces the space complexity to O(d), at the small price of increasing the worst case loss to no more than O(d log n). The main intuition for our next algorithm follows from the loss analysis of Algorithm 1. Our proof of Theorem 8 led to a choice of α = 2 d/2d, which essentially causes each probabilistic prediction to be largely determined by the prediction made by the longest conjunction consistent with the already seen positive examples. This observation led us to consider

Algorithm 2 πd(x1:n; a1:n)

1: D {1, 2, . . . , d}; r 1 2: for t = 1 to n do 3: Observe at 4: if Q i D ai t = 1 5: then pt(1; at) := t/(t + 1); pt(0; at) := 1/(t + 1) 6: else pt(1; at) := 1/(t + 1); pt(0; at) := t/(t + 1) 7: endif 8: Observe xt and suffer a loss of log pt(xt; at). 9: if xt = 1 then D D \ {i {1, . . . , d} : ai t = 0} endif 10: r pt(xt; at) r 11: end for 12: return r

Algorithm 2, which uses a smoothed of this. More formally,

πd(xt|x<t; a1:t) := t t+1

i Dt ai t = xt

i Dt ai t = xt

where Dt denotes the indices of the variables not ruled out from the positive examples occurring before time n. Pseudocode for implementing this procedure in O(d) time per iteration, using O(d) space, is given in Algorithm 2. The set D incrementally maintains the set Dt. Compared to Algorithm 1, the key computational advantage of this approach is that it doesn t need to remember the negative instances. We next upper bound the loss of Algorithm 2.

Theorem 9. If x1:n is generated by a hypothesis h S such that S Pd, then for all n N, for all d N, for all x1:n Bn, for all a1:n Bn d, we have that Ln(πd) (d + 1) log(n + 1).

Proof. As x1:n is generated by some h S where S Pd, we have that Ln(πd) = log πd(x1:n; a1:n). We break the analysis of this term into two cases. At any time 1 t n, we have either: Case (i): V i D ai t = , which implies h S(at) = 1 for all S D = Dt. As the data is generated by some h S , we must have S Dt and therefore xt = 1, so a loss of log t t+1 is suffered. Case (ii): V i D ai t = , where one of two situations occur: a) if xt = 0 we suffer a loss of log t t+1; otherwise b) we suffer a loss of log(1/(t+1)) = log(t+1) and at least one element in D gets removed. Notice that as the set D is initialized with d elements, case b) can only occur at most d times given any sequence of data. Finally, notice that Case (ii b) contributes at most at d times log(n + 1) to the loss. On the other hand, log t+1

t is suffered for each t of case (i) and (ii a), which can be upper bounded by Pn t=1 log t+1

t = log(n + 1). Together they give the desired upper bound (d + 1) log(n + 1).

We also remark that Algorithm 2 could have been deﬁned so that pt(1; at) = 1 whenever V i A ai t = 1. The reason we instead predicted 1 with probability t/(t + 1) is that it allows Algorithm 2 to avoid suffering an inﬁnite loss if the data is not generated by some monotone conjunction. If however we are prepared to assume the realizable case, one can modify

Algorithm 2 so as to also have ﬁnite cumulative loss for any possible inﬁnite sequence of examples. First deﬁne

t [1, n] : xt = 0

i D ai t = 0

t [1, n] : xt = 1

i D ai t = 0

Next we change the probabilities assigned in Line 5 to pt(1; at) = 1, pt(0; at) = 0 and in Line 6 to

pt(1; at) = 1 (n t + 1)2 , pt(0; at) = n t (n t + 2) (n t + 1)2

where n t := |T t 1| + 1. If we denote this variant by π d, we can show the following theorem. Theorem 10. If x1: B is generated by a hypothesis h S such that S Pd, then for all d N, for all a1: (Bd) , we have that L (π d) < .

Proof. Exploiting the assumption that x1:n is generated by a hypothesis h S , the cumulative loss can be expressed as

Ln(π d) = log

n t (n t + 2) (n t + 1)2 Y

1 (n t + 1)2

i(i + 2) (i + 1)2

t T + n log(n t + 1).

(13) Now we can bound the ﬁrst summand in Equation 13 by

i(i + 2) (i + 1)2

i=1 log(i + 2) + 2

i=1 log(i + 1)

i=1 log i +

i=2 log(i + 1) +

i=1 log(i + 1)

= log(|T n | + 1) + 1 log(|T n | + 2) 1. Thus we have Ln(π d) 1 + 2 X

t T + log(n t + 1)

1 + 2d max t T + n log(n t + 1) (since |T + n | d)

1 + 2d log max(T + n ) + 1 .

Since |T + n | d, the above bound implies L (π d) < .

There are two main drawbacks with this approach. The ﬁrst is that an inﬁnite loss can be suffered in the non-realizable case. The second is that, in the case of ﬁnite sequences, the loss is upper bounded (see the proof of Theorem 10) by 2d log(n+1) instead of (d+1) log(n+1) as per Theorem 9.

A method that exploits simplicity in the hypothesis space. The well-known WINNOW1 algorithm [Littlestone, 1988] can also be used to learn monotone conjunctions online under the 0/1 loss via the transformation given in Example 5 of the original paper. Provided a hypothesis h S with S Pd generates the data, the total number of mistakes this algorithm makes is known to be upper bounded by

α|S | (logα θ + 1) + d

whenever the algorithm hyper-parameters satisfy both α > 1 and θ 1/α. For example, by setting θ = d/2 and α = 2, one can bound the number of mistakes by 2|S | log d + 2. This particular parametrization is well suited to the case where |S | d; that is, whenever many features are irrelevant. Here we describe an adaptation of this method for the logarithmic loss along with a worst-case analysis. Although the resultant algorithm will not enjoy as strong a loss guarantee as Algorithm 2 in general, one would prefer its guarantees in situations where |S | d. The main idea is to assign probability t/(t + 1) at time t to the class predicted by WINNOW1 (as applied to monotone conjunctions). Pseudocode for this procedure is given in Algorithm 3. The following theorem bounds the cumulative loss. Compared with Theorem 9, here we see that a multiplicative dependence on O(|S | log d) is introduced in place of the previous O(d), which is preferred whenever |S | d. Theorem 11. If x1:n is generated by a hypothesis h S such that S Pd, then for all n N, for all d N, for all x1:n Bn, for all a1:n Bn d, we have that

Ln(ωd) α|S | (logα θ + 1) + d

θ + 1 log (n + 1) .

In particular, if θ = d/2 and α = 2 then

Ln(ωd) (2|S | log d + 2) log (n + 1) .

Proof. Let Mn denote the set of times where the original WINNOW1 algorithm would make a mistaken prediction, i.e.

Mn := {t [1, n] : yt = xt} .

Also let Mn := [1, n] \ Mn. We can now bound the cumulative loss by

Ln(ωd) = log ωd(x1:n; a1:n)

(a) = log Y

Algorithm 3 ωd(x1:n; a1:n) Require: α, θ R such that α > 1, θ 1/α

1: wi 1, 1 i d

2: for t = 1 to n do

3: Observe at

4: yt r Pd i=1 wi(1 ai t) θ z

5: pt(yt; at) := t/(t + 1) 6: pt(1 yt; at) := 1/(t + 1)

7: Observe xt and suffer a loss of log pt(xt; at).

8: if yt = xt then 9: if xt = 1 then 10: wi 0 if ai t = 0, 1 i d 11: else 12: wi αwi if ai t = 0, 1 i d 13: end if 14: end if

15: r pt(xt; at) r

16: end for

17: return r

(b) log 1 n + 1

αk(logα θ+1)+d/θ

n αk(logα θ+1)+d/θ Y

= [αk(logα θ + 1) + d/θ] log(n + 1) + log (n αk(logα θ + 1) + d/θ + 1) [αk(logα θ + 1) + d/θ + 1] log(n + 1). Step (a) follows from deﬁnition of Algorithm 3 and Mn. Step (b) applies both Equation 14 and that 1/(n + 1) 1/(t + 1) for all 1 t n.

5 Handling k-CNF Boolean functions Finally, we describe how our techniques can be used to probabilistically predict the output of an unknown k-CNF function. Given a set of d variables {x1, . . . , xd}, a k-CNF Boolean function is a conjunction of clauses c1 c2 cm, where for 1 y m, each clause cy is a disjunction of k literals, with each literal being an element from {x1, . . . , xd, x1, . . . , xd}. The number of syntactically distinct clauses is therefore (2d)k. We will use the notation Ck d to denote the class of k-CNF Boolean formulas that can be formed from d variables. The task of probabilistically predicting a k-CNF Boolean function of d variables can be reduced to that of probabilistically predicting a monotone conjunction over a larger space of input variables. We can directly use the same reduction as used by Valiant [1984] to show that the class of k-CNF Boolean functions is PAC-learnable. The main idea is to ﬁrst transform the given side information a Bd into a new Boolean vector c B(2d)k, where each component of c corresponds to the truth value for each distinct k-literal clause

formed from the set of input variables {ai}d i=1, and then run either Algorithm 1 or Algorithm 2 on this transformed input. In the case of Algorithm 1, this results in an online algorithm where each iteration takes O(dk) time; given n examples, the algorithm runs in O(ndk) time and uses O(ndk) space. Furthermore, if we denote the above process using either Algorithm 1 or Algorithm 2 as ALG1k d or ALG2k d respectively, then Theorems 8 and 9 allows us to upper bound the loss of each approach.

Corollary 12. For all n N, for all k N, for any sequence of side information a1:n Bn d, if x1:n is generated from a hypothesis h Ck d, the loss of ALG1k d and ALG2k d with respect to h satisﬁes the upper bounds Ln(ALG1) 22k+1d2k and Ln(ALG2) 2kdk + 1 log(n + 1) respectively.

6 Closing Remarks

This paper has provided three efﬁcient, low-loss online algorithms for probabilistically predicting targets generated by some unknown k-CNF Boolean function of d Boolean variables in time (for ﬁxed k) polynomial in d. The construction of Algorithm 1 is technically interesting in the sense that it is a hybrid Bayesian technique, which performs full Bayesian inference only on the positive examples, with a prior carefully chosen so that the loss suffered on negative examples is kept small. This approach may be potentially useful for more generally applying the ideas behind Bayesian inference or exponential weighted averaging in settings where a direct application would be computationally intractable. The more practical Algorithm 2 is less interpretable, but has O(d) space complexity and a per instance time complexity of O(d), while enjoying a loss within a multiplicative log n factor of the intractable Bayesian predictor using a uniform prior. The ﬁnal method, a derivative of WINNOW, has favorable regret properties when many of the input features are expected to be irrelevant. In terms of practical utility, we envision our techniques being most useful as component of a larger predictive ensemble. To give a concrete example, consider the statistical data compression setting, where the cumulative log-loss under some probabilistic model directly corresponds to the size of a ﬁle encoded using arithmetic encoding [Witten et al., 1987]. Many strong statistical data compression techniques work by adaptively combining the outputs of many different probabilistic models. For example, the high performance PAQ compressor uses a technique known as geometric mixing [Mattern, 2013], to combine the outputs of many different contextual models in a principled fashion. Adding one of our techniques to such a predictive ensemble would give it the property that it could exploit k-CNF structure in places where it exists.

Acknowledgements. The authors would like to thank the following: Brendan Mc Kay, for providing the proof of Theorem 4; Kee Siong Ng, for the suggestion to investigate the class of k-CNF formulas from an online, probabilistic perspective; Julien Cornebise, for some helpful comments and

discussions; and ﬁnally to the anonymous reviewers for pointing out the connection to the WINNOW algorithm.

References [Aji and Mc Eliece, 2000] S.M. Aji and R.J. Mc Eliece. The generalized distributive law. Information Theory, IEEE Transactions on, 46(2):325 343, 2000. [Bratko et al., 2006] Andrej Bratko, Gordon V. Cormack, David R, Bogdan Filipi, Philip Chan, Thomas R. Lynam, and Thomas R. Lynam. Spam ﬁltering using statistical data compression models. Journal of Machine Learning Research (JMLR), 7:2673 2698, 2006. [Cilibrasi and Vit anyi, 2005] Rudi Cilibrasi and Paul M. B. Vit anyi. Clustering by compression. IEEE Transactions on Information Theory, 51:1523 1545, 2005. [Frank et al., 2000] Eibe Frank, Chang Chui, and Ian H. Witten. Text categorization using compression models. In Proceedings of Data Compression Conference (DCC), pages 200 209. IEEE Computer Society Press, 2000. [Gy orgy et al., 2011] A. Gy orgy, T. Linder, and G. Lugosi. Efﬁcient Tracking of Large Classes of Experts. IEEE Transactions on Information Theory, 58(11):6709 6725, 2011. [Koolen et al., 2012] Wouter M. Koolen, Dmitry Adamskiy, and Manfred K. Warmuth. Putting Bayes to sleep. In Neural Information Processing Systems (NIPS), pages 135 143, 2012. [Littlestone and Warmuth, 1994] Nick Littlestone and Manfred K. Warmuth. The Weighted Majority Algorithm. Information and Computation, 108(2):212 261, February 1994. [Littlestone, 1988] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In Machine Learning, pages 285 318, 1988. [Mattern, 2013] Christopher Mattern. Linear and Geometric Mixtures - Analysis. In Proceedings of the 2013 Data Compression Conference, DCC 13, pages 301 310. IEEE Computer Society, 2013. [Shamir and Merhav, 1999] Gil I. Shamir and Neri Merhav. Low Complexity Sequential Lossless Coding for Piecewise Stationary Memoryless Sources. IEEE Transactions on Information Theory, 45:1498 1519, 1999. [Vadhan, 2001] S. P. Vadhan. The complexity of counting in sparse, regular, and planar graphs. SIAM Journal on Computing, 31(2):398 427, 2001. [Valiant, 1984] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134 1142, November 1984. [van Erven et al., 2007] Tim van Erven, Peter Gr unwald, and Steven de Rooij. Catching Up Faster in Bayesian Model Selection and Model Averaging. Neural Information Processing Systems (NIPS), 2007. [Veness et al., 2011] Joel Veness, Kee Siong Ng, Marcus Hutter, William T. B. Uther, and David Silver. A montecarlo AIXI approximation. J. Artif. Intell. Res. (JAIR), 40:95 142, 2011.

[Veness et al., 2012a] Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael H. Bowling. Context Tree Switching. In Data Compression Conference (DCC), pages 327 336, 2012. [Veness et al., 2012b] Joel Veness, Peter Sunehag, and Marcus Hutter. On Ensemble Techniques for AIXI Approximation. In AGI, pages 341 351, 2012. [Veness et al., 2013] J. Veness, M. White, M. Bowling, and A. Gyorgy. Partition Tree Weighting. In Data Compression Conference (DCC), 2013, pages 321 330, 2013. [Veness et al., 2015] Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. Compress and control. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, January 2530, 2015, Austin, Texas, USA., pages 3016 3023, 2015. [Willems and Krom, 1997] F. Willems and M. Krom. Liveand-die coding for binary piecewise i.i.d. sources. In IEEE International Symposium on Information Theory (ISIT), page 68, 1997. [Willems et al., 1995] Frans M.J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The Context Tree Weighting Method: Basic Properties. IEEE Transactions on Information Theory, 41:653 664, 1995. [Willems, 1996] Frans M. J. Willems. Coding for a binary independent piecewise-identically-distributed source. IEEE Transactions on Information Theory, 42:2210 2217, 1996. [Witten et al., 1987] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30:520 540, June 1987.