# online_learning_of_kcnf_boolean_functions__18d7784f.pdf Online Learning of k-CNF Boolean Functions Joel Veness1, Marcus Hutter2, Laurent Orseau1, Marc Bellemare1 1Google Deep Mind, 2Australian National University {veness,lorseau,bellemare}@google.com, marcus.hutter@anu.edu.au This paper revisits the problem of learning a k-CNF Boolean function from examples, for fixed k, in the context of online learning under the logarithmic loss. We give a Bayesian interpretation to one of Valiant s classic PAC learning algorithms, which we then build upon to derive three efficient, online, probabilistic, supervised learning algorithms for predicting the output of an unknown k-CNF Boolean function. We analyze the loss of our methods, and show that the cumulative log-loss can be upper bounded by a polynomial function of the size of each example. 1 Introduction In 1984, Leslie Valiant introduced the notion of Probably Approximately Correct (PAC) learnability, and gave three important examples of some non-trivial concept classes that could be PAC learnt given nothing more than a sequence of positive examples drawn from an arbitrary IID distribution [Valiant, 1984]. One of these examples was the class of k CNF Boolean functions, for fixed k. Valiant s approach relied on a polynomial time reduction of this problem to that of PAC learning the class of monotone conjunctions. In this paper, we revisit the problem of learning monotone conjunctions from the perspective of universal source coding, or equivalently, online learning under the logarithmic loss. In particular we derive three new online, probabilistic prediction algorithms that: (i) learn from both positive and negative examples; (ii) avoid making IID assumptions; (iii) suffer low logarithmic loss for arbitrary sequences of examples; (iv) run in polynomial time and space. This work is intended to complement previous work on concept identification [Valiant, 1984] and online learning under the 0/1 loss [Littlestone, 1988; Littlestone and Warmuth, 1994]. The main motivation for investigating online learning under the logarithmic loss is the fundamental role it plays within information theoretic applications. In particular, we are interested in prediction methods that satisfy the following power desiderata, i.e. methods which: (p) make probabilistic predictions; (o) are strongly online; (w) work well in practice; (e) are efficient; and (r) have well understood regret/loss properties. Methods satisfying these prop- erties can be used in a number of principled and interesting ways: for example, data compression via arithmetic encoding [Witten et al., 1987], compression-based clustering [Cilibrasi and Vit anyi, 2005] or classification [Frank et al., 2000; Bratko et al., 2006], and information theoretic reinforcement learning [Veness et al., 2011; 2015]. Furthermore, it is possible to combine such online, log-loss predictors using various ensemble methods [Veness et al., 2012b; Mattern, 2013]. The ability to rapidly exploit deterministic underlying structure such as k-CNF relations has the potential to improve all the aforementioned application areas, and brings the universal source coding literature in line with developments originating from the machine learning community. Our contribution in this paper stems from noticing that Valiant s method can be interpreted as a kind of MAP model selection procedure with respect to a particular family of priors. In particular, we show that given n positive examples and their associated d-dimensional binary input vectors, it is possible to perform exact Bayesian inference over the 2d possible monotone conjunction hypotheses in time O(nd) and space O(d) without making IID assumptions. Unfortunately, these desirable computational properties do not extend to the case where both positive and negative examples are presented; we show that in this case exact inference is #P-complete. This result motivated us to develop a hybrid algorithm, which uses a combination of Bayesian inference and memorization to construct a polynomial time algorithm whose loss is bounded by O(d2) for the class of monotone conjunctions. Furthermore, we show how to trade constant loss for logarithmic cumulative loss to get a more practical algorithm, whose loss is bounded by O(d log n). We also give an alternative method, based on the WINNOW algorithm [Littlestone, 1988] for 0/1 loss, which has better theoretical properties in cases where many of the d Boolean inputs are irrelevant. Finally, similarly to Valiant, we describe how to combine our algorithms with a reduction that (for fixed k) enables the efficient learning of k-CNF Boolean functions from examples. 2 Preliminaries Notation. A Boolean variable x is an element of B := { , } = {0, 1}. We identify false with 0 and true with 1, since it allows us to use Boolean functions as likelihood functions for deterministically generated data. We keep the boolean operator notation whenever more sugges- Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) tive. The unary not operator is denoted by , and is defined as : 0 7 1; 1 7 0 ( x = 1 x). The binary conjunction and disjunction operators are denoted by and respectively, and are given by the maps : (1, 1) 7 1; or 0 otherwise (x y = x y). and : (0, 0) 7 0; or 1 otherwise (x y = max{x, y}). A literal is a Boolean variable x or its negation x; a positive literal is a non-negated Boolean variable. A clause is a finite disjunction of literals. A monotone conjunction is a conjunction of zero or more positive literals. For example, x1 x3 x6 is a monotone conjunction, while x1 x3 is not. We adopt the usual convention with conjunctions of defining the zero literal case to be vacuously true. The power set of a set S is the set of all subsets of S, and will be denoted by P(S). For convenience, we further define Pd := P({1, 2, . . . , d}). We also use the Iverson bracket notation JPK, which given a predicate P, evaluates to 1 if P is true and 0 otherwise. We also use the notation x1:n and x 0 for each of the 2d possible target strings, which implies that L2d(ρ) d. Memorization. As a further motivating example, it is instructive to compare the exact Bayesian predictor to that of a naive method for learning monotone conjunctions that simply memorizes the training instances, without exploiting the logical structure within the class. To this end, consider the sequential predictor that assigns a probability of md(xn|x 1 2, with the expected formula length being αd. If we denote the prior over S by wα(S) := α|S|(1 α)d |S|, we get the mixture ξα d (x1:n; a1:n) = P S Pd wα(S)νS(x1:n; a1:n). From this we can read off the maximum a posteriori (MAP) model S n := arg max S Pd wα(S | x1:n; a1:n) = arg max S Pd wα(S) νS(x1:n|a1:n) under various choices of α. For positive examples (i.e. x1:n = 11:n), this can be rewritten as S n = arg max S Pd wα(S) 2, the MAP model S n at time n is unique, and is given by i {1, . . . , d} : For α = 1 2, a MAP model is any subset of S n as defined by Equation 4. For α < 1 2, the MAP model is {}. Finally, we remark that the above results allow for a Bayesian interpretation of Valiant s algorithm for PAC learning monotone conjunctions. His method, described in Section 5 of [Valiant, 1984], after seeing n positive examples, outputs the concept V i S n xi; in other words, his method can be interpreted as doing MAP model selection using a prior belonging to the above family when α > 1 A Heuristic Predictor. Next we discuss a heuristic prediction method that incorporates Proposition 6 to efficiently perform Bayesian learning on only the positive examples. Consider the probabilistic predictor ξ+ d defined by ξ+ d (xn|x 1 and θ 1/α. For example, by setting θ = d/2 and α = 2, one can bound the number of mistakes by 2|S | log d + 2. This particular parametrization is well suited to the case where |S | d; that is, whenever many features are irrelevant. Here we describe an adaptation of this method for the logarithmic loss along with a worst-case analysis. Although the resultant algorithm will not enjoy as strong a loss guarantee as Algorithm 2 in general, one would prefer its guarantees in situations where |S | d. The main idea is to assign probability t/(t + 1) at time t to the class predicted by WINNOW1 (as applied to monotone conjunctions). Pseudocode for this procedure is given in Algorithm 3. The following theorem bounds the cumulative loss. Compared with Theorem 9, here we see that a multiplicative dependence on O(|S | log d) is introduced in place of the previous O(d), which is preferred whenever |S | d. Theorem 11. If x1:n is generated by a hypothesis h S such that S Pd, then for all n N, for all d N, for all x1:n Bn, for all a1:n Bn d, we have that Ln(ωd) α|S | (logα θ + 1) + d θ + 1 log (n + 1) . In particular, if θ = d/2 and α = 2 then Ln(ωd) (2|S | log d + 2) log (n + 1) . Proof. Let Mn denote the set of times where the original WINNOW1 algorithm would make a mistaken prediction, i.e. Mn := {t [1, n] : yt = xt} . Also let Mn := [1, n] \ Mn. We can now bound the cumulative loss by Ln(ωd) = log ωd(x1:n; a1:n) (a) = log Y Algorithm 3 ωd(x1:n; a1:n) Require: α, θ R such that α > 1, θ 1/α 1: wi 1, 1 i d 2: for t = 1 to n do 3: Observe at 4: yt r Pd i=1 wi(1 ai t) θ z 5: pt(yt; at) := t/(t + 1) 6: pt(1 yt; at) := 1/(t + 1) 7: Observe xt and suffer a loss of log pt(xt; at). 8: if yt = xt then 9: if xt = 1 then 10: wi 0 if ai t = 0, 1 i d 11: else 12: wi αwi if ai t = 0, 1 i d 13: end if 14: end if 15: r pt(xt; at) r 16: end for 17: return r (b) log 1 n + 1 αk(logα θ+1)+d/θ n αk(logα θ+1)+d/θ Y = [αk(logα θ + 1) + d/θ] log(n + 1) + log (n αk(logα θ + 1) + d/θ + 1) [αk(logα θ + 1) + d/θ + 1] log(n + 1). Step (a) follows from definition of Algorithm 3 and Mn. Step (b) applies both Equation 14 and that 1/(n + 1) 1/(t + 1) for all 1 t n. 5 Handling k-CNF Boolean functions Finally, we describe how our techniques can be used to probabilistically predict the output of an unknown k-CNF function. Given a set of d variables {x1, . . . , xd}, a k-CNF Boolean function is a conjunction of clauses c1 c2 cm, where for 1 y m, each clause cy is a disjunction of k literals, with each literal being an element from {x1, . . . , xd, x1, . . . , xd}. The number of syntactically distinct clauses is therefore (2d)k. We will use the notation Ck d to denote the class of k-CNF Boolean formulas that can be formed from d variables. The task of probabilistically predicting a k-CNF Boolean function of d variables can be reduced to that of probabilistically predicting a monotone conjunction over a larger space of input variables. We can directly use the same reduction as used by Valiant [1984] to show that the class of k-CNF Boolean functions is PAC-learnable. The main idea is to first transform the given side information a Bd into a new Boolean vector c B(2d)k, where each component of c corresponds to the truth value for each distinct k-literal clause formed from the set of input variables {ai}d i=1, and then run either Algorithm 1 or Algorithm 2 on this transformed input. In the case of Algorithm 1, this results in an online algorithm where each iteration takes O(dk) time; given n examples, the algorithm runs in O(ndk) time and uses O(ndk) space. Furthermore, if we denote the above process using either Algorithm 1 or Algorithm 2 as ALG1k d or ALG2k d respectively, then Theorems 8 and 9 allows us to upper bound the loss of each approach. Corollary 12. For all n N, for all k N, for any sequence of side information a1:n Bn d, if x1:n is generated from a hypothesis h Ck d, the loss of ALG1k d and ALG2k d with respect to h satisfies the upper bounds Ln(ALG1) 22k+1d2k and Ln(ALG2) 2kdk + 1 log(n + 1) respectively. 6 Closing Remarks This paper has provided three efficient, low-loss online algorithms for probabilistically predicting targets generated by some unknown k-CNF Boolean function of d Boolean variables in time (for fixed k) polynomial in d. The construction of Algorithm 1 is technically interesting in the sense that it is a hybrid Bayesian technique, which performs full Bayesian inference only on the positive examples, with a prior carefully chosen so that the loss suffered on negative examples is kept small. This approach may be potentially useful for more generally applying the ideas behind Bayesian inference or exponential weighted averaging in settings where a direct application would be computationally intractable. The more practical Algorithm 2 is less interpretable, but has O(d) space complexity and a per instance time complexity of O(d), while enjoying a loss within a multiplicative log n factor of the intractable Bayesian predictor using a uniform prior. The final method, a derivative of WINNOW, has favorable regret properties when many of the input features are expected to be irrelevant. In terms of practical utility, we envision our techniques being most useful as component of a larger predictive ensemble. To give a concrete example, consider the statistical data compression setting, where the cumulative log-loss under some probabilistic model directly corresponds to the size of a file encoded using arithmetic encoding [Witten et al., 1987]. Many strong statistical data compression techniques work by adaptively combining the outputs of many different probabilistic models. For example, the high performance PAQ compressor uses a technique known as geometric mixing [Mattern, 2013], to combine the outputs of many different contextual models in a principled fashion. Adding one of our techniques to such a predictive ensemble would give it the property that it could exploit k-CNF structure in places where it exists. Acknowledgements. The authors would like to thank the following: Brendan Mc Kay, for providing the proof of Theorem 4; Kee Siong Ng, for the suggestion to investigate the class of k-CNF formulas from an online, probabilistic perspective; Julien Cornebise, for some helpful comments and discussions; and finally to the anonymous reviewers for pointing out the connection to the WINNOW algorithm. References [Aji and Mc Eliece, 2000] S.M. Aji and R.J. Mc Eliece. The generalized distributive law. Information Theory, IEEE Transactions on, 46(2):325 343, 2000. [Bratko et al., 2006] Andrej Bratko, Gordon V. Cormack, David R, Bogdan Filipi, Philip Chan, Thomas R. Lynam, and Thomas R. Lynam. Spam filtering using statistical data compression models. Journal of Machine Learning Research (JMLR), 7:2673 2698, 2006. [Cilibrasi and Vit anyi, 2005] Rudi Cilibrasi and Paul M. B. Vit anyi. Clustering by compression. IEEE Transactions on Information Theory, 51:1523 1545, 2005. [Frank et al., 2000] Eibe Frank, Chang Chui, and Ian H. Witten. Text categorization using compression models. In Proceedings of Data Compression Conference (DCC), pages 200 209. IEEE Computer Society Press, 2000. [Gy orgy et al., 2011] A. Gy orgy, T. Linder, and G. Lugosi. Efficient Tracking of Large Classes of Experts. IEEE Transactions on Information Theory, 58(11):6709 6725, 2011. [Koolen et al., 2012] Wouter M. Koolen, Dmitry Adamskiy, and Manfred K. Warmuth. Putting Bayes to sleep. In Neural Information Processing Systems (NIPS), pages 135 143, 2012. [Littlestone and Warmuth, 1994] Nick Littlestone and Manfred K. Warmuth. The Weighted Majority Algorithm. Information and Computation, 108(2):212 261, February 1994. [Littlestone, 1988] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In Machine Learning, pages 285 318, 1988. [Mattern, 2013] Christopher Mattern. Linear and Geometric Mixtures - Analysis. In Proceedings of the 2013 Data Compression Conference, DCC 13, pages 301 310. IEEE Computer Society, 2013. [Shamir and Merhav, 1999] Gil I. Shamir and Neri Merhav. Low Complexity Sequential Lossless Coding for Piecewise Stationary Memoryless Sources. IEEE Transactions on Information Theory, 45:1498 1519, 1999. [Vadhan, 2001] S. P. Vadhan. The complexity of counting in sparse, regular, and planar graphs. SIAM Journal on Computing, 31(2):398 427, 2001. [Valiant, 1984] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134 1142, November 1984. [van Erven et al., 2007] Tim van Erven, Peter Gr unwald, and Steven de Rooij. Catching Up Faster in Bayesian Model Selection and Model Averaging. Neural Information Processing Systems (NIPS), 2007. [Veness et al., 2011] Joel Veness, Kee Siong Ng, Marcus Hutter, William T. B. Uther, and David Silver. A montecarlo AIXI approximation. J. Artif. Intell. Res. (JAIR), 40:95 142, 2011. [Veness et al., 2012a] Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael H. Bowling. Context Tree Switching. In Data Compression Conference (DCC), pages 327 336, 2012. [Veness et al., 2012b] Joel Veness, Peter Sunehag, and Marcus Hutter. On Ensemble Techniques for AIXI Approximation. In AGI, pages 341 351, 2012. [Veness et al., 2013] J. Veness, M. White, M. Bowling, and A. Gyorgy. Partition Tree Weighting. In Data Compression Conference (DCC), 2013, pages 321 330, 2013. [Veness et al., 2015] Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. Compress and control. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 2530, 2015, Austin, Texas, USA., pages 3016 3023, 2015. [Willems and Krom, 1997] F. Willems and M. Krom. Liveand-die coding for binary piecewise i.i.d. sources. In IEEE International Symposium on Information Theory (ISIT), page 68, 1997. [Willems et al., 1995] Frans M.J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The Context Tree Weighting Method: Basic Properties. IEEE Transactions on Information Theory, 41:653 664, 1995. [Willems, 1996] Frans M. J. Willems. Coding for a binary independent piecewise-identically-distributed source. IEEE Transactions on Information Theory, 42:2210 2217, 1996. [Witten et al., 1987] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30:520 540, June 1987.