# lifted_inference_for_convex_quadratic_programs__059d39fd.pdf

Lifted Inference for Convex Quadratic Programs

Martin Mladenov TU Dortmund University, Germany martin.mladenov@cs.tu-dortmund.de

Leonard Kleinhans TU Dortmund University, Germany leonard.kleinhans@cs.tu-dortmund.de

Kristian Kersting TU Dortmund University, Germany kristian.kersting@cs.tu-dortmund.de

Symmetry is the essential element of lifted inference that has recently demonstrated the possibility to perform very efﬁcient inference in highly-connected, but symmetric probabilistic models. This raises the question, whether this holds for optimization problems in general. Here we show that for a large class of optimization methods this is actually the case. Specifically, we introduce the concept of fractional symmetries of convex quadratic programs (QPs), which lie at the heart of many AI and machine learning approaches, and exploit it to lift, i.e., to compress QPs. These lifted QPs can then be tackled with the usual optimization toolbox (off-the-shelf solvers, cutting plane algorithms, stochastic gradients etc.). If the original QP exhibits symmetry, then the lifted one will generally be more compact, and hence more efﬁcient to solve.

Introduction

Convex optimization is arguably one of the main motors behind Artiﬁcial Intelligence (AI) as it enables inference and learning in a wide variety of AI models, such as SVMs, LASSO and efﬁcient approximations (e.g. variational approaches, convex NMF) to hard inference tasks. The language in which convex optimization problems are speciﬁed includes inequalities, matrix and tensor algebra, and software packages for convex optimization such as CVXPY (Diamond, Chu, and Boyd 2014) recreate this language as an interface between the user and the solver. Unfortunately, a pure algebraic language has one shortcoming: it is difﬁcult if not impossible for the non-expert to directly make use of the discrete, combinatorial structure often underlying convex programs; pixels depend only on neighboring pixels; the reward of placing a cup on a table does not depend on whether the window in the next room is open. Having a richer representation such as ﬁrst-order logic to express the combinatorial structure and an automatic way to utilize it in the solver, however, is likely extend the reach and efﬁciency of AI. This has been demonstrated by statistical relational learning (SRL) that has argued in favor of ﬁrst-order languages when dealing with complex graphical models, see e.g. (De Raedt et al. 2016) for a recent overview. Due to the high-level nature of the relational probabilistic languages, the low-level

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(ground) model they produce might often contain redundancies in terms of symmetries. Lifted probabilistic inference approaches (Poole 2003) exploit these symmetries to perform very efﬁcient inference in highly-connected (and hence otherwise often intractable for traditional inference approach) but symmetric models. Intuitively, one infers which variables are indistinguishable in the ground model (if possible without actually grounding) and solves the model treating the indistinguishable variables as groups instead of individuals to reduce the dimensionality of the model. Unfortunately, SRL does not support convex quadratic programs. Here, we demonstrate that the core idea of SRL can be transferred to convex quadratic programming. As our main contribution, we formalize the notion of symmetries of convex quadratic programs (QPs). Speciﬁcally, we ﬁrst show that unlike for graphical models, where the notion of indistinguishability of variables is that of exact symmetry (automorphisms of the factor graph), QPs admit a weaker (partitions of indistinguishable variables which are at least as coarse) notion of indistinguishability called a fractional automorphism (FA) resp. equitable partition (EP) computable in quasi-linear time. This implies that more general lifted inference rules for QPs can be designed. This is surprising, as it was believed that FAs apply only to linear equations. Second, we investigate geometrically how FAs of quadratic forms arise. The existing theory of symmetry in convex quadratic forms states that an automorphism of x T Qx corresponds to a rotational symmetry of the semideﬁnite factors of Q. We generalize this in that FA of x T Qx can be related not only to rotations, but also to certain scalings. This then results in the ﬁrst approximate FA approach based on standard clustering techniques and whitening. Finally, we tackle the question to which extend kernels might preserve fractional symmetry. All this is embedded in the ﬁrst relational QP language as illustrated in Fig. 1(left), which is not discussed due to space limitations. In doing so, the present work is the ﬁrst that introduces relational convex QPs and studies their symmetries. Indeed, there are symmetry-breaking branch-and-bound approaches for (mixed )integer programming (Margot 2010) that are also featured by commercial solvers. QPs, however, do not feature branch-and-bound solvers. For the special fragment of LPs, (Kersting, Mladenov, and Tokmakov 2015) have introduced a relational language and shown how to exploit fractional symmetries. (Relaxed) graph automorphisms and variants have

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

# logical query for linking papers linked(I1, I2) = label(I1) & query(I2) & (cite(I1, I2) | cite(I2, I1)) # inline definitions slacks = sum{I in labeled(I)} slack(I); coslacks = sum{I1, I2 in linked(I1, I2)} slack(I1,I2) # QUADRATIC OBJECTIVE, the main novelty compared to [Kersting et al., 2015] minimize: sum{J in feature(I,J)} weight(J)**2 + c1 * slack + c2 * coslack; subject to forall {I in labeled(I)}: labeled(I)*predict(I) >= 1 - slack(I); # correct prediction subject to forall {I in labeled(I)}: slack(I) >= 0; # slacks are positive # TRANSDUCTIVE PART: cited instances should have the same labels. subject to forall {I1, I2 in linked(I1, I2)}: labeled(I1) * predict(I2) >= 1 - slack(I1, I2); subject to forall {I1, I2 in linked(I1, I2)}: coslack(I1, I2) >= 0; #coslacks are positive

Figure 1: Illustration of the relational QP language, a novel extension of the relational LP language due to (Kersting, Mladenov, and Tokmakov 2015). (Left) A relational SVM (TC-QP-SVM) for collective classiﬁcation of objects with relations among them, here scientiﬁc papers with citations. No relational kernel is used, just a linear one with relational constraints taking the citation links into account. (Right) QPs outperform LPs, in particular for less many observed labels. (Best viewed in color)

been explored for graph kernels (Shervashidze and Borgwardt 2009) and (I)LP-MAP inference approaches (Bui, Huynh, and Riedel 2013; Mladenov, Globerson, and Kersting 2014; Jernite, Rush, and Sontag 2015). Unfortunately, their proofs do not carry over to (convex) QPs. (Güler and Gürtuna 2012) and references in there have studied automorphisms but not fractional ones of convex sets. Finally, indeed, several expressive modeling languages for mathematical programming have been proposed, see e.g. (Wallace and Ziemba 2005) for a recent overview. They are mixtures of declarative and imperative programming styles using sets of objects to index multidimensional parameters and variables. Recently, (Diamond, Chu, and Boyd 2014) enabled an object-oriented approach to constructing optimization problems. None of them provide integrated capabilities with relational logic, not to speak of lifting. The need for relational mathematical programming languages is witnessed e.g. in natural language processing (Yih and Roth 2007; Riedel, Smith, and Mc Callum 2012) and the recent push to marry statistical analytic frameworks like R and Python with relational databases (Ré et al. 2015). We proceed as follows. We start off by developing automorphisms of QPs, introducing the required background on the ﬂy. Then, we generalize this to fractional symmetries. Before concluding, we illustrate our theoretical results empirically.

Exact Symmetries of Convex QPs Let us start off with exact symmetries of convex QPs. Lifting convex quadratic programs essentially amount to reducing the size a model by grouping together indistinguishable variables and constraints. In other words, they exploit symmetries. To formalize the notion of lifting more concisely let us consider a convex program, i.e., an optimization problem of the form x arg minx PD Jpxq , ( ) over x P Rn, where J : Rn Ñ R is a convex function, and D is a subset of Rn, typically speciﬁed as the solution a system of convex inequalities f1pxq ď 0, . . . , fmpxq ď 0. A convex quadratic program (QP) is an instance of p q where Jpxq x T Qx c T x is a quadratic function with Q P Rnˆn is symmetric and positive semi-deﬁnite, and D is a convex subset of Rn speciﬁed as a system of linear equations, D tx : Ax ď bu. If Q is the zero matrix, the problem is

known as a linear program (LP). If we add convex quadratic constraints to a quadratic program, we obtain a quadratically constrained quadratic program (QCQP). We will not deal explicitly with QCQPs in this paper, however, by the end of our discussion of quadratic functions, it will be evident that our results can easily be extended to such programs. We shall denote a QP by the tuple QP p Q, c, A, bq. We are now interested in partitioning the variables of the program by a partition P t P1, . . . , Ppu, Pi X Pj H, Ť

i Pi tx1, . . . , xnu, such that there exists at least one solution that respects the partition. More formally, P is a lifting partition of p q if p q admits an optimal solution with xi xj whenever xi and xj are in the same class in P. We call the linear subspace deﬁned by the latter condition RP. Having apriori obtained a lifting partition of the QP, we can restrict the solution space to DXRP. That is, we constrain indistinguishable variables to be equal, knowing that at least one solution will be preserved in this space of lower dimension. Since ground variables of the same class are now equal, they can be replaced with a single aggregated (lifted) variable. The resulting lifted problem has one variable per equivalence class, thus, if the lifting partition is coarse enough, signiﬁcant dimensionality reduction and run-time savings can be achieved. To recover a ground solution, one assigns the value of the lifted variable to every ground variable in its class. One way to demonstrate that a given partition P is a lifting partition for p q is by showing that averaging any feasible x over the partition classes (i.e. rxi 1 | classpxiq| ř

xj Pclasspxiq xj) yields a new feasible rx with Jprxq ď Jpxq. Theorem 1 that we will prove later on will provide sufﬁcient conditions for this. As a consequence, by averaging any optimal solution we get another optimal solution which respects P, implying that P is a lifting partition. Please note if there is only one solution with different values in all coordinates then this certiﬁes that there are no symmetries. In any case, one bit of notation that is handy in the analysis averaging operations is the partition matrix. To any partition P we can associate a matrix XP P Qnˆn such that XP ij 1{| classpxiq| if xj P classpxiq or 0 otherwise. With XP deﬁned thusly, averaging x over the classes of P is equivalent to multiplying by XP, i.e., rx XPx. Partition matrices are always doubly stochastic (XP1 1), symmetric (p XPq T XP), and idempotent (XPXP XP)

as a consequence also semideﬁnite. Example. We seek to minimize the function x T Qx over x P R4, subject to x ě 1, with Q given in Fig. 2. As a lifting partition, we propose P ttx1, x3u, tx2, x4uu (in the next paragraph, we will explain how one could compute this lifting partition). The corresponding partition matrix XP is also shown on Fig. 2. Let us demonstrate that averaging over the classes of P decreases the value of the solution. For example, for x0 r2, 1, 1, 2s T , x T 0 Qx0 3. On the other hand, the class-averaged rx0 XPx0 r1.5, 1.5, 1.5, 1.5s T yields a value of 0. In fact, one could notice that any feasible x respecting the partition yields a value of 0, so any such solution is optimal. Moreover, if all coordinates of x are already greater than or equal to 1, then the same holds for rx, as averages cannot be lower than the minimum of the averaged numbers. Thus, the compressed problem reduces to ﬁnding any two numbers that ě 1 . l An intuitive way to ﬁnd lifting partitions is via automorphism groups of convex problems. We deﬁne the automorphism group of p q, Autp q, as the group of all pairs of permutations pσ, πq with permutation matrices pΣ, Πq, such that for all x, Jpxq JpΠxq and pf1pΠxq ď 0, . . . , fmpΠxq ď 0q pfσp1qpxq ď 0, . . . , fσpmqpxq ď 0q. In other words, renaming the variables yields the same constraints up to reordering. For linear programs (LPs), this is equivalent to ΣA AΠ and Σb b and c T Π c T . The partition that groups together xi with xj if some Π in Autp q exchanges them is called an orbit partition. An interesting fact is that if P is an orbit partition, XP is the symmetrizer matrix of Autp q, XP 1 | Autp q| ř

pΣ,Πq PAutp q Π. One way to detect renaming symmetries is by inspection of the parameters of the problem. E.g., for a convex quadratic program p Q, c, A, bq, a set of necessary conditions for the pair of permutations pΣ, Πq to be a renaming symmetry is: (i) ΠQ QΠ (equivalently ΠQΠT Q), (ii) c T Π c T , (iii) ΣA AΠ, and (iv) Σb b. Such automorphism groups, or rather, the orbit partitions thereof, can be computed via packages such as Saucy (Codenotti et al. 2013). The reason why orbit partitions are lifting partitions of a convex problem, is that Jp XPxq Jp 1 | Aut | ř pΣ,Πq PAut Πxq ď

1 | Aut | ř

pΣ,Πq PAut JpΠxq Jpxq, the inequality being due to convexity of J. Reconsider our example on Fig. 2. Permutations renaming row/column 1 to 3 resp. 2 to 4 are automorphisms, and P is an orbit partition. For the special case of LPs, (Grohe et al. 2014) have proven that equitable partitions act as lifting partitions. An equitable partition (EP) of a square symmetric n ˆ n matrix M is a partition P of 1, . . . , n, such that XP satisﬁes XPM MXP. For rectangular matrices, we say that a partition P of the columns is equitable, if there exists a partition of the rows Q such that XQM MXP. For LPs, we say that a partition of the variables P is equitable if there exists a partition of the constraints Q such that: c T XP c T , XQb b, and XQA AXP. EPs and their corresponding partition matrices are referred to as fractional automorphisms resp. fractional symmetries we will use both terms in an exchangeable ways as they satisfy the same conditions as automorphisms from the previous paragraph,

Figure 2: Running example for fractional symmetries of QPs. (Top) A matrix speciﬁcation of minimizex PR4 x T Qx s. t. Ax ď b and the partition matrix XP of the partition P ttx1, x3u, tx2, x4uu. (Bottom) The factor B with BBT Q as well as a sketch of the rows of B. Multiplying B by the matrix M on the right, which equates to rescaling and rotating the vectors by 45 , is a symmetry of B; it yields the same conﬁguration modulo renaming.

except that XP is a doubly stochastic matrix and not a permutation matrix. Moreover, EPs have an equivalent combinatorial characterization. A partition P of M P n ˆ n is equitable if for all i, j in the same class P and every class P 1 (including P 1 P), we have ř

k PP 1 Mik ř

k PP 1 Mjk. In other words, if we reorder the rows and columns of M such that indices of the same class are next to each other, M will take on a block-rectangular form where every row (and column) of the block has the same sum. One special ﬂavor of EPs are what we will call counting partitions, where a narrower condition holds, |tk P P 1|Mik cu| |tk P P 1|Mjk cu| for all c P R, and Mii Mjj if i, j are in the same class. They partition M into blocks where each row (and column) has the same count of each number. The EP of our example is such a partition. In fact, any orbit partition of a permutation group is a counting partition as well. EPs have several very attractive properties when used as lifting partitions. First, the coarsest WP (as well as the coarsest counting equitable partition) of a matrix is computable in Oppe nq logpnqq time, where e is the number of non-zeroes in the matrix, via an elegant algorithm called color reﬁnement(Grohe et al. 2014). Second, the coarsest EP is at least as coarse as the orbit partition of a matrix, hence it offers more compression.

Fractional Symmetry of Convex QPs

Having developed automorphisms of convex QPs, we now move on to our main contributions. We develop FA esp. EPs of a convex QP. We start off with showing that they are lifted partitions. Then, we provide a geometric interpretation and investigate whether kernels preserve fractional symmetries. Equitable Partitions (EPs) of QPs: We prove now that the lifting partition of a convex QP captures its symmetries.

Theorem 1. Let QP p Q, c, A, bq be a convex quadratic program. If P is a partition of the variables of QP , such that: (a) XPQ QXP and c T XP c T , (b) there exists a partition Q of the constraints of QP such that XQb b and XQA AXP, then P is a lifting partition for QP .

Proof. We proceed along the lines drawn out in the previous section and show that for any feasible x, x1 XPx, the class-averaged x, is both feasible and Jpx1q ď Jpxq. Let us start with the latter. Note that both Q and XP are diagonalizable (i.e. admit an eigendecomposition) since they are symmetric and real matrices. It is known that if two diagonalizable matrices commute (as is our starting hypothesis, XPQ QXP), then they are also simultaneously diagonalizable. That is, there exists an orthonormal basis of vectors u1, . . . , un such that Q ř

i λiuiu T i UΛU T and XP ř

i κiuiu T i UKU T , where the λi s and κi s are nonnegative scalars. Now, Jpx1q Jp XPxq x T p XPq T QXPx c T XPx. From our discussion so far and assumption (a), this is equal to x T UKT U T UΛU T UKU T x c T x x T UΛK2U T x c T x. The key observation is that because XP is doubly stochastic, |κi| ď 1. Hence x T UK2ΛU T x ř i κ2 i λix T uiu T i x ď ř i λix T uiu T i x as λix T uiu T i x is a nonnegative quantity. This entails Jpxq ě Jpx1q. Regarding feasibility, because XQ is a matrix of nonnegative numbers, Ax ď b implies XQAx ď XQb. Due to (b), this becomes AXPx ď b, that is, Ax1 ď b, demonstrating the feasibility of x1. We have thus satisﬁed the two sufﬁcient conditions stated in the previous section and shown that any P satisfying our assumptions is a lifting partition for QP . l Example. Recall Q from our running example on Fig. 2. However, this time we propose P1 ttx1, x2, x3, x4uu as a lifting partition with XP1 1

4 14, where 14 is the 4 ˆ 4 matrix of ones. We observe that QXP1 XP1Q 04, moreover, if we introduce the constraint partition Q1 ty1, ..., y4u with partition matrix XQ1 XP1, we have that XQ1A AXP1 and XQ1b b. According to Thm. 1 P1 is a lifting partition of the QP in question. l There are two interesting observations to be made here. First, we have gained even further compression over our previous attempt, having a compressed problem with 1 variable instead of 2. Second, there is no automorphism of Q that could possibly exchange x1 and x2. As fractional symmetries generalize exact symmetries, see e.g. (Godsil 1997), it is to be expected that coarser equitable partitions than the orbit partition Q could satisfy the conditions of Thm 1. Moreover, these observations allow one to gain insight into what fractional symmetry means geometrically for a dataset. This is important as the matrix Q relates to the data we feed into the optimization problem for many QPs; e.g., in the SVM dual QP, the entries of Q are inner products of the training feature vectors. Geometry of Fractionally-Symmetric QPs: Our investigation is inspired by the characterization of automorphisms of semideﬁnite matrices and quadratic forms. One way to think about a semideﬁnite matrix Q is as the Gram matrix of a set of vectors, i.e. Q BBT where B is an n ˆ k matrix and k ě rankp Qq. In this light, the quadratic form x T Qx can be seen as the squared Euclidean norm of a matrix-vector product. That is, x T Qx x T BBT x p BT xq T p BT xq ||BT x||2. It is a basic fact that the Euclidean norm is invariant under orthonormal transformations, that is, for any orthonormal matrix O and any vector y, ||OT y|| y as y T OOT y y T y. Thus, suppose we have a rotational

automorphism of B, i.e., a pair of orthonormal matrix O and permutation matrix Π, such that ΠB BO or also ΠBOT B. That is, rotating the tuple of vectors that are the rows of B together yields same tuple back, but in different order. Observe then, that Π would be a renaming automorphism for Q, since ΠQΠT ΠBOT OBΠT BBT Q, implying ΠQ QΠ. Moreover, if the right dimension (number of columns) B is held ﬁxed, the converse is true as well (Bremner, Dutour Sikri c, and Schürmann 2009). That is, not only do rotational symmetries of B correspond to renaming symmetries of Q, but vice-versa, as for ﬁxed k, the semideﬁnite factors of Q are unique up to rotations. Example. Our Q from Fig. 2 can be factored into BBT as shown on Fig. 2. The Figure also shows the plot of these vectors. If we were to rotate them by 180 counter-clockwise, we would get back the same set of vectors, but in the order tx3, x4, x1, x2u. The permutation matrix according to this reordering is a renaming automorphism of Q. l Using the case of automorphisms as a motivation, we now turn to fractional automorphisms. More precisely, given a doubly stochastic and idempotent matrix X, such that XQ QX, we would like to derive a similar characterization of X in terms of B. As we prove now, this is indeed possible. Theorem 2. Let X be a symmetric and idempotent (as our usual color-reﬁnement automorphisms are) matrix, and Q BBT be a positive semideﬁnite matrix with B having full column rank. Then XQ QX if and only if there exists a symmetric matrix R such that XB BR. Proof. (only if direction): Let R be such that R RT and XB BR . Then, XQ XBBT BRBT . Making use of R RT this rewrites as BRT BT Bp BRq T Bp XBq T BBT XT QX , as X is symmetric. (if direction): Let XQ QX with X being idempotent and symmetric. Then, let R BT XBp BT Bq 1. Observe that Bp BT Bq 1 exists and is the right pseudoinverse of BT , i.e., BT Bp BT Bq 1 Ik, as B has full column rank. Therefore, left multiplying by Ik yields XB XBBT Bp BT Bq 1 BBT XBp BT Bq 1 BR . It remains to demonstrate that R is symmetric. Recall that RT R and p RT Rq 1 are symmetric matrices. Then, RT R BT XBp BT Bq 1 T BT XBp BT Bq 1 . Since, p BT Bq T p BT Bq 1 and XBBT BBT X, this simpliﬁes to p BT Bq 1p BT Bq XXBp BT Bq 1. Since XX X and using Ik, this simpliﬁes to BT XBp BT Bq 1 R . Hence, as RT R R, R is symmetric. l This theorem holds the key to explaining why all 4 dimensions in our example are compressed together. To see why, consider the situation on Fig 2. Example. Fig. 2 shows the factor B of Q (as well as a sketch of its rows) and an invertible matrix M, which consists of a clockwise rotation by 45 which aligns the vectors with the axes, a rescaling of the vectors along the axes, then a further 45 . Multiplying B by M yields back the same row vectors modulo a cyclic permutation, exchanging x4 with x1, x1 with x2 and so on, i.e. ΣB BM. Moreover BMM T BT Q. The group of t M, M 2, M 3, M 4u is thus a group that does not correspond to any group of automorphisms of Q, yet, the symmetrizer matrix 1

4 ř4 i 1 M i is symmetric (and equal to

02), so it qualiﬁes under the conditions of Thm. 2. l From this we can conclude that certain scaling symmetries of B do not result in symmetries of Q, but do result in fractional symmetries of Q (Thm. 2 ). On the other hand, by Thm. 1, we can also infer that these symmetries can safely be compressed out when minimizing the quadratic form x T Qx. Note ﬁnally that even these symmetries do not exhaust the possible matrices of Thm. 2, unless the graph isomorphism problem is P-TIME solvable. Thm. 2 allows for partitions and matrices that do not correspond to any group. Characterizing them is an exciting avenue for future work. Approximately Lifted SVMs: Indeed, one may argue that the (rotational) automorphism group of most Euclidean datasets consists of the identity transformation alone. This follows from the same result for convex bodies, see e.g. (Güler and Gürtuna 2012), and is to be expected, since the symmetry properties of a given dataset B can easily be destroyed by slightly perturbing the body. To bypass this, we propose the ﬁrst approximate lifting approach for Euclidean datasets.

Proposition 3. Let B be an Euclidean dataset and D its corresponding pairwise distance matrix. Then Bi and Bj are in the same (rotational) orbit if an only if Bi and Bj have the same sorted distances to all other data points.

Proof. The EP of D encodes the symmetries of B. To compute it, we represent it as a colored graph C of D. We note that C is a clique with edge colors encoding distances. We turn this into a node-colored graph by assigning the same color to all nodes that have identical edge-color signatures. Running color-reﬁnement on this graph does not add any new color since C is a clique. l This suggest a simple way to compute proper approximations of (rotational) EPs of B as illustrated in Fig. 3(left): (1, optional) Whiten the data to capture some scalings, (2) compute the pairwise distance matrix D of B (potentially using anchor points), (3) sort each row of D, and (4) run any cluster algorithm on the sorted distance matrix. More importantly, it connects lifted inference to SVM approaches that reduce the size of the optimization problem by extracting a small number of representatives from the original training set and using them to train an approximate SVM, see e.g. (Boley and Cao 2004). In particular, one can easily prove (proof left out due to space restrictions) that the same PAC-style generalization bound applies (Cao and Boley 2007): the approximately lifted SVM will very likely have a small expected error rate if it has a small empirical loss over the original dataset. Kernels and Equitable Partitions: Finally, we touch upon the relationship between the fractional symmetry of data vectors and kernels. Kernel functions often appear in conjunction with quadratic optimization in machine learning problems as a means of enriching the hypothesis space of a learner. From an algebraic perspective, the essence of the approach is to replace the entries of the semideﬁnite matrix Q with the values of a kernel function, which represents the inner product of data vectors under some non-linear transformation in a high dimensional space. That is, in place of Qij B i, B j , we use Kij kp B i, B jq φp Bi q, φp Bj q , where φ : Rn Ñ Rm is some non-linear function with m much greater than n or even inﬁnite. Due to the prevalence of ker-

nels, it is important to understand whether kernels preserve or destroy symmetries. Here, we will examine two popular kernels, the polynomial kernel, k POLYpx, yq p x, y 1qg and k RBFpx, yq expp 2γ2||x y||2 2q, where g is a positive integer and γ is a nonzero real number. We ﬁnd that in both cases, if Q BBT admits a counting equitable partition, then K will admit the same partition as well, i.e., these two kernels preserve fractional symmetry of Q up to counting (recall, that includes rotational symmetry of B):

Proposition 4. Let B be a matrix whose rows are data instances. Then, if Q BBT admits a counting equitable partition P with partition matrix XP, then both kernel matrices (a) KPOLY and (b) KRBF of this set of vectors admit the same counting partition.

Proof. Recall that an EP P is a counting partition for Q if for all xi, xj in the same class P, and for every class P 1 (including P 1 P), |txk P P 1|Qik cu| |txk P P 1|Qjk cu| for all c P R, and Qii Qjj. (a) A direct consequence of this deﬁnition is that if P is a counting partition for Q, it will be a counting partition for every other matrix whose equality pattern respects that of Q, in other words, Qij Qpq ñ Kij Kpq. KPOLY has exactly this property: KPOLY ij p Bi , Bj 1qg p Qij 1qg. It is clear that if Qij and Qpq are equal, the values of the last expression would be equal as well. (b) First, we note KRBF ij expp 2γ2||Bi ||2qexpp 2γ2||Bj ||2qexpp γ2 Bi , Bj q. This allows one to rewrite KRBF in terms of Q: KRBF ij expp 2γ2Qiiq expp 2γ2Qjjq expp γ2Qijq. Now, let xi, xj P P and xp, xq P P 1 such that Qip Qjq. Since Qii Qjj (by virtue of being in P) and Qpp Qqq (by virtue of P 1), we have that KRBF ip KRBF jq hence counts across classes are preserved. l To summarize, convex QP can be lifted without loss of quality: compute in quasi-linear time its quotient model w.r.t its EP as illustrated in Fig. 3(right). For the polynomial and RBF kernels, this also leads to valid liftings.

Empirical Illustration Our intention here is to investigate whether (Q) AI can potentially beneﬁt from relational and lifted QPs? As our main experiment, we compared our lifted QP approach to relational linear programs (Kersting, Mladenov, and Tokmakov 2015), following their experimental setup for the Cora dataset (Sen et al. 2008) consisting of 2708 scientiﬁc papers classiﬁed into seven classes. Each paper is described by a binary word vector indicating the absence/presence of a word from a dictionary of 1433 words. The citation network of the papers consisting of 5429 links. The goal is to predict the class of the paper. For simplicity, we converted this problem to a binary classiﬁcation problem by taking the largest of the 7 classes as a positive class. We compared four different learners on Cora. The base classiﬁers are an 8-norm regularized SVM (LP-SVM) (Zhou, Zhang, and Jiao 2002) and a conventional SVM (QP-SVM) (Vapnik 1998) formulated as a convex QP. Both use the word feature vectors and do standard linear prediction (no kernel used). Additionally we considered transductive, collective versions of both of them, again

Figure 3: (Left) Approximate EPs without scalings on 3D datasets. The colors encode the rotational symmetries under a budget of 4 resp. 5 orbits. E.g. the hand consists of 327.323 points (a clique with 50 1010 edges) running in ă 5 secs using 2500 anchor points. (Right) Convex QPs can be exactly lifted by running color-reﬁnement (Ahmadi et al. 2013) on the colored graph encoding it. This takes quasi-linear time, and one can directly read off XP. After color-reﬁnement, nodes with the same color form the quotient QP of the original QP. Since the corresponding constraints of variables with the same color (they are in the same orbit) are identical, one can drop all but one of the identical constraints; this forms XQ. This compression is exact, i.e., the lifted QP computes an optimal solution of the original QP. (Best viewed in color)

following (Kersting, Mladenov, and Tokmakov 2015), denoted as TC-LP-SVM resp. TC-QP-SVM. Both transductive approaches have access to the citation network and implement the following simple rule: whenever we have access to an unlabeled paper i, if there is a cited or citing labeled paper j, then assume the label of j as a label of i. To account for contradicting constraint (a paper citing both papers of and not of its class), we introduced separate slack variables for the transductive constraints and add them to the objective with a different penalty parameter. This can easily be implemented by adding a few relational constraints to an existing standard QP-SVM formulation, see Fig. 1(left). In order to investigate the performance, we varied the amount of labeled examples available. That is, we have four cases, where we restricted the number of labeled examples to t 20%, 40%, 60%, and 80% of size of the dataset. We ﬁrst randomly split the dataset into a labeled set L and an unlabeled test set B, according to t. Then, we split L randomly in half, leaving one half A for training, the other half becoming a validation set C. The validation set was used to select the parameters of the TC-QPSVM in a 5-fold cross-validation fashion. That is, we split the validation set into 5 subsets Ci of equal size. On these sets we selected the parameter using a grid search for each Ci on a AYp Cz Ciq labeled and B YCi unlabeled examples, computing the prediction error on Ci and averaging it over all Cis. We then evaluated the selected parameters on the test set B whose labels were never revealed in training. We repeated this experiment 5 times (one for each Ci) for the TCSVMs. For consistency, we followed the same protocol with QP-SVM and LP-SVM, except that the set B Y Ci did not appear during training as the non-transductive learners have no use for unlabeled examples. That is, we selected parameters by training on A Y p Cz Ciq and evaluating on Ci. The selected parameters were then evaluated on the test set B. For all SVM models, we also ran a ground and a lifted version. The results are summarized in Fig. 1(right): QP-SVM outperforms LP-SVM for each setting, both are outperformed by TC-L/QP-SVM, and TC-QP-SVM outperforms TC-LP-SVM. While there was no appreciable symmetry in either QP-SVM or LP-SVM, TC-QP-SVM exhibited signiﬁcant variable and constraint reduction: the lifted problem was reduced to up to 78% of the variables, resp., 70% of the constraints of the

ground problem, while achieving the same accuracy1. To illustrate our approach on propositional data, we considered SVM classiﬁers for varying amounts of overlap between two classes represented by spherical Gaussians. This dataset was chosen in order to depict the potential of approximate symmetries. We trained a lifted SVM (LSVM) with 200 approximate color classes and a conventional SVM, both with RBF kernels, on 2500 training examples per class. We used a grid search together with CV for selecting γ t0.25, 0.50, 1.00, 2.00, 4.00u and C t0.5, 1.0, 2.0u. The performance was measured on an independently drawn test set of 5000 data points per class. For approximate lifting we used k-Means using the Euclidean metric and 500 anchor points. For 4 units apart class centers, the SVM achieved an error of 0.02 in 20 secs (all numbers in this experiment are averaged over 10 reruns and rounded to the second digit), while the LSVM achieved 0.02 in 1.7 secs. For 2 units apart class centers, the SVM took 98 secs achieving an error of 0.16, while the LSVM achieved 0.17 in 2.1 secs. Thus, the generalization bound guarantees that the LSVM will very likely have an expected error rate comparable to the SVM, at a fraction of time. Overall, the empirical results are clear evidence for an afﬁrmative answer to question (Q).

Conclusions We have deepened the understanding of symmetries in statistical AI and extended the scope of lifted inference. Speciﬁcally, we have introduced and studied a precise mathematical deﬁnition of fractional symmetry of convex QPs. Using the tool of fractional automorphism, orbits of optimization variables are obtained, and lifted solvers materialize as performing the corresponding optimization problem in the space of perorbit optimization variables. This enables the lifting of a large class of AI tools such as spectral relaxations for MRF inference (Cour and Shi 2007) and quadratic assignments problems as implied by (Lu and Boutilier 2015). We here instantiated the framework for QP-SVMs by developing the ﬁrst lifted and relational solvers for them and illustrating

1Qualitatively similar results were obtained on the two-moons dataset with 150 additional features, each drawn randomly from a Gaussian per example, using the 4-NN graph as "citation network".

empirically their beneﬁts. In the future, other AI settings and more datasets should be explored, one should investigate the link to other data reduction methods, move beyond convex QPs, and explore our framework for symmetry-based learning (Gens and Domingos 2014)2. Acknowledgments: The authors would like to thank the anonymous reviewers for their feedback. The work was partly supported by the DFG Collaborative Research Center SFB 876, project A6 Resource-efﬁcient Graph Mining .

References Ahmadi, B.; Kersting, K.; Mladenov, M.; and Natarajan, S. 2013. Exploiting symmetries for scaling loopy belief propagation and relational training. Machine Learning 92:91 132. Boley, D., and Cao, D. 2004. Training support vector machines using adaptive clustering. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), 126 137. Bremner, D.; Dutour Sikri c, M.; and Schürmann, A. 2009. Polyhedral representation conversion up to symmetries. In Proceedings of the 2006 CRM Workshop on Polyhedral Computations. AMS, Providence. Bui, H.; Huynh, T.; and Riedel, S. 2013. Automorphism groups of graphical models and lifted variational inference. In Proc. of the 29th Conference on Uncertainty in Artiﬁcial Intelligence (UAI). Cao, D., and Boley, D. 2007. A PAC bound for approximate support vector machines. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM), 455 460. Codenotti, P.; Katebi, H.; Sakallah, K.; and Markov, I. 2013. Conﬂict analysis and branching heuristics in the search for graph automorphisms. In Proc. of ICATI, 907 914. Cour, T., and Shi, J. 2007. Solving markov random ﬁelds with spectral relaxation. In Proceedings of the Eleventh International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 75 82. De Raedt, L.; Kersting, K.; Natarajan, S.; and Poole, D. 2016. Statistical Relational Artiﬁcial Intelligence: Logic, Probability, and Computation. Morgan & Claypool Publishers. Diamond, S.; Chu, E.; and Boyd, S. 2014. CVXPY: A Python-embedded modeling language for convex optimization, version 0.2. http://cvxpy.org/. Gens, R., and Domingos, P. 2014. Deep symmetry networks. In Proc. of the Annual Conference on Neural Information Processing Systems (NIPS), 2537 2545. Godsil, C. 1997. Compact graphs and equitable partitions. Linear Algebra Applications 255:259 266.

2Initial results are promising. We considered MNIST images for two randomly selected classes and augmented the dataset by applying rotations (1 , 2 , . . . , 360 ) to each images. An SVM learned on all (original+augmented) examples achieved 95.9% accuracy on the independent test set, taking 10.5 hours. In contrast, an LSVM with assumed symmetries putting rotations of p0 , 1 , . . . , 9 q, p10 , . . . , 19 q, . . . degrees into the same color class achieved 96.0% accuracy in less then 5 minutes. For both classiﬁers, cost parameters were selected using a grid-search on a subset of the training set.

Grohe, M.; Kersting, K.; Mladenov, M.; and Selman, E. 2014. Dimension reduction via colour reﬁnement. In Proceedings of the 22th Annual European Symposium on Algorithms (ESA), 505 516. Güler, O., and Gürtuna, F. 2012. Symmetry of convex sets and its applications to the extremal ellipsoids of convex bodies. Optimization Methods and Software 27(4-5):735 759. Jernite, Y.; Rush, S.; and Sontag, D. 2015. A fast variational approach for learning markov random ﬁeld language models. In Proc. of the 32nd International Conference on Machine Learning (ICML). Kersting, K.; Mladenov, M.; and Tokmakov, P. 2015. Relational linear programming. Artiﬁcial Intelligence Journal (AIJ) Online First. Lu, T., and Boutilier, C. 2015. Value-directed compression of large-scale assignment problems. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence (AAAI). Margot, F. 2010. Symmetry in integer linear programming. In Jünger, M.; Liebling, T.; Naddef, D.; Nemhauser, G.; Pulleyblank, W.; Reinelt, G.; Rinaldi, G.; and Wolsey, L., eds., 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art. Springer. 1 40. Mladenov, M.; Globerson, A.; and Kersting, K. 2014. Lifted message passing as reparametrization of graphical models. In Proc. of the 30th Int. Conf. on Uncertainty in Artiﬁcial Intelligence (UAI). Poole, D. 2003. First-order probabilistic inference. In Proc. of the Eighteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI), 985 991. Ré, C.; Agrawal, D.; Balazinska, M.; Cafarella, M.; Jordan, M.; Kraska, T.; and Ramakrishnan, R. 2015. Machine learning and databases: The sound of things to come or a cacophony of hype? In Proc. of the ACM SIGMOD International Conference on Management of Data, 283 284. Riedel, S.; Smith, D.; and Mc Callum, A. 2012. Parse, Price and Cut Delayed Column and Row Generation for Graph Based Parsers. In Proc. of the EMNLP-Co NLL, 732 743. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Gallagher, B.; and Eliassi-Rad, T. 2008. Collective classiﬁcation in network data. AI Magazine 29(3):93 106. Shervashidze, N., and Borgwardt, K. 2009. Fast subtree kernels on graphs. In Proc. of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), 1660 1668. Vapnik, V. 1998. Statistical learning theory. Adaptive and learning systems for signal processing, communications and control series. John Wiley & Sons, New York. A Wiley Interscience Publication. Wallace, S., and Ziemba, W., eds. 2005. Applications of Stochastic Programming. SIAM, Philadelphia. Yih, W., and Roth, D. 2007. Global inference for entity and relation identiﬁcation via a linear programming formulation. In Getoor, L., and Taskar, B., eds., An Introduction to Statistical Relational Learning. MIT Press. Zhou, W.; Zhang, L.; and Jiao, L. 2002. Linear programming support vector machines. Pattern recognition 35(12):2927 2936.