# higherorder_factorization_machines__111c0709.pdf

Higher-Order Factorization Machines

Mathieu Blondel, Akinori Fujino, Naonori Ueda NTT Communication Science Laboratories Japan

Masakazu Ishihata Hokkaido University Japan

Factorization machines (FMs) are a supervised learning approach that can use second-order feature combinations even when the data is very high-dimensional. Unfortunately, despite increasing interest in FMs, there exists to date no efﬁcient training algorithm for higher-order FMs (HOFMs). In this paper, we present the ﬁrst generic yet efﬁcient algorithms for training arbitrary-order HOFMs. We also present new variants of HOFMs with shared parameters, which greatly reduce model size and prediction times while maintaining similar accuracy. We demonstrate the proposed approaches on four different link prediction tasks.

1 Introduction

Factorization machines (FMs) [13, 14] are a supervised learning approach that can use second-order feature combinations efﬁciently even when the data is very high-dimensional. The key idea of FMs is to model the weights of feature combinations using a low-rank matrix. This has two main beneﬁts. First, FMs can achieve empirical accuracy on a par with polynomial regression or kernel methods but with smaller and faster to evaluate models [4]. Second, FMs can infer the weights of feature combinations that were not observed in the training set. This second property is crucial for instance in recommender systems, a domain where FMs have become increasingly popular [14, 16]. Without the low-rank property, FMs would fail to generalize to unseen user-item interactions.

Unfortunately, although higher-order FMs (HOFMs) were brieﬂy mentioned in the original work of [13, 14], there exists to date no efﬁcient algorithm for training arbitrary-order HOFMs. In fact, even just computing predictions given the model parameters naively takes polynomial time in the number of features. For this reason, HOFMs have, to our knowledge, never been applied to any problem. In addition, HOFMs, as originally deﬁned in [13, 14], model each degree in the polynomial expansion with a different matrix and therefore require the estimation of a large number of parameters.

In this paper, we propose the ﬁrst efﬁcient algorithms for training arbitrary-order HOFMs. To do so, we rely on a link between FMs and the so-called ANOVA kernel [4]. We propose linear-time dynamic programming algorithms for evaluating the ANOVA kernel and computing its gradient. Based on these, we propose stochastic gradient and coordinate descent algorithms for arbitrary-order HOFMs. To reduce the number of parameters, as well as prediction times, we also introduce two new kernels derived from the ANOVA kernel, allowing us to deﬁne new variants of HOFMs with shared parameters. We demonstrate the proposed approaches on four different link prediction tasks.

2 Factorization machines (FMs)

Second-order FMs. Factorization machines (FMs) [13, 14] are an increasingly popular method for efﬁciently using second-order feature combinations in classiﬁcation or regression tasks even when the data is very high-dimensional. Let w Rd and P Rd k, where k N is a rank hyper-parameter. We denote the rows of P by pj and its columns by ps, for j [d] and s [k],

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

where [d] := {1, . . . , d}. Then, FMs predict an output y R from a vector x = [x1, . . . , xd]T by

ˆy FM(x) := w, x + X

j >j pj, pj xjxj . (1)

An important characteristic of (1) is that it considers only combinations of distinct features (i.e., the squared features x2 1, . . . , x2 d are ignored). The main advantage of FMs compared to naive polynomial regression is that the number of parameters to estimate is O(dk) instead of O(d2). In addition, we can compute predictions in O(2dk) time1 using

ˆy FM(x) = w Tx + 1

s=1 ps x 2 ,

where indicates element-wise product [3]. Given a training set X = [x1, . . . , xn] Rd n and y = [y1, . . . , yn]T Rn, w and P can be learned by minimizing the following non-convex objective

i=1 ℓ(yi, ˆy FM(xi)) + β1

where ℓis a convex loss function and β1 > 0, β2 > 0 are hyper-parameters. The popular libfm library [14] implements efﬁcient stochastic gradient and coordinate descent algorithms for obtaining a stationary point of (2). Both algorithms have a runtime complexity of O(2dkn) per epoch.

Higher-order FMs (HOFMs). Although no training algorithm was provided, FMs were extended to higher-order feature combinations in the original work of [13, 14]. Let P (t) Rd kt, where t {2, . . . , m} is the order or degree of feature combinations considered, and kt N is a rank hyper-parameter. Let p(t) j be the jth row of P (t). Then m-order HOFMs can be deﬁned as

ˆy HOFM(x) := w, x + X

j >j p(2) j , p(2) j xjxj + + X

jm> >j1 p(m) j1 , . . . , p(m) jm xj1xj2 . . . xjm (3)

where we deﬁned p(t) j1 , . . . , p(t) jt := sum( p(t) j1 p(t) jt ) (sum of element-wise products). The objective function of HOFMs can be expressed in a similar way as for (2):

i=1 ℓ(yi, ˆy HOFM(xi)) + β1

2 P (t) 2, (4)

where β1, . . . , βm > 0 are hyper-parameters. To avoid the combinatorial explosion of hyperparameter combinations to search, in our experiments we will simply set β1 = = βm and k2 = = km. While (3) looks quite daunting, [4] recently showed that FMs can be expressed from a simpler kernel perspective. Let us deﬁne the ANOVA2 kernel [19] of degree 2 m d by

Am(p, x) := X

t=1 pjtxjt. (5)

For later convenience, we also deﬁne A0(p, x) := 1 and A1(p, x) := p, x . Then it is shown that

ˆy HOFM(x) = w, x +

s=1 A2 p(2) s , x + +

s=1 Am p(m) s , x , (6)

where p(t) s is the sth column of P (t). This perspective shows that we can view FMs and HOFMs as a type of kernel machine whose support vectors are learned directly from data. Intuitively, the ANOVA kernel can be thought as a kind of polynomial kernel that uses feature combinations without replacement (i.e., of distinct features). A key property of the ANOVA kernel is multi-linearity [4]:

Am(p, x) = Am(p j, x j) + pjxj Am 1(p j, x j), (7)

where p j denotes the (d 1)-dimensional vector with pj removed and similarly for x j. That is, everything else kept ﬁxed, Am(p, x) is an afﬁne function of pj j [d]. Although no training

1We include the constant factor for fair later comparison with arbitrary-order HOFMs. 2The name comes from the ANOVA decomposition of functions. [20, 19]

algorithm was provided, [4] showed based on (7) that, although non-convex, the objective function of arbitrary-order HOFMs is convex in w and in each row of P (2), . . . , P (m), separately.

Interpretability of HOFMs. An advantage of FMs and HOFMs is their interpretability. To see why this is the case, notice that we can rewrite (3) as

ˆy HOFM(x) = w, x + X

j >j W(2) j,j xjxj + + X

jm> >j1 W(m) j1,...,jmxj1xj2 . . . xjm,

where we deﬁned W(t) := Pkt s=1 p(t) s p(t) s | {z } t times

. Intuitively, W(t) Rdt is a low-rank t-way

tensor which contains the weights of feature combinations of degree t. For instance, when t = 3, W(3) i,j,k is the weight of xixjxk. Similarly to the ANOVA decomposition of functions, HOFMs consider only combinations of distinct features (i.e., xj1xj2 . . . xjm for jm > > j2 > j1).

This paper. Unfortunately, there exists to date no efﬁcient algorithm for training arbitrary-order HOFMs. Indeed, computing (5) naively takes O(dm), i.e., polynomial time. In the following, we present linear-time algorithms. Moreover, HOFMs, as originally deﬁned in [13, 14] require the estimation of m 1 matrices P (2), . . . , P (m). Thus, HOFMs can produce large models when m is large. To address this issue, we propose new variants of HOFMs with shared parameters.

3 Linear-time stochastic gradient algorithms for HOFMs

The kernel view presented in Section 2 allows us to focus on the ANOVA kernel as the main computational unit for training HOFMs. In this section, we develop dynamic programming (DP) algorithms for evaluating the ANOVA kernel and computing its gradient in only O(dm) time.

Evaluation. The main observation (see also [18, Section 9.2]) is that we can use (7) to recursively remove features until computing the kernel becomes trivial. Let us denote a subvector of p by p1:j Rj and similarly for x. Let us introduce the shorthand aj,t := At(p1:j, x1:j). Then, from (7), aj,t = aj 1,t + pjxj aj 1,t 1 d j t 1. (8)

For convenience, we also deﬁne aj,0 = 1 j 0 since A0(p, x) = 1 and aj,t = 0 j < t since there does not exist any t-combination of features in a j < t dimensional vector.

Table 1: Example of DP table

j = 0 j = 1 j = 2 . . . j = d t = 0 1 1 1 1 1 t = 1 0 a1,1 a2,1 . . . ad,1 t = 2 0 0 a2,2 . . . ad,2 ...

... ... ... ... ... t = m 0 0 0 . . . ad,m

The quantity we want to compute is Am(p, x) = ad,m. Instead of naively using recursion (8), which would lead to many redundant computations, we use a bottom-up approach and organize computations in a DP table. We start from the top-left corner to initialize the recursion and go through the table to arrive at the solution in the bottomright corner. The procedure, summarized in Algorithm 1, takes O(dm) time and memory.

Gradients. For computing the gradient of Am(p, x) w.r.t. p, we use reverse-mode differentiation [2] (a.k.a. backpropagation in a neural network context), since it allows us to compute the entire gradient in a single pass. We supplement each variable aj,t in the DP table by a so-called adjoint aj,t := ad,m

aj,t , which represents the sensitivity of ad,m = Am(p, x) w.r.t. aj,t. From recursion (8), except for edge cases, aj,t inﬂuences aj+1,t+1 and aj+1,t. Using the chain rule, we then obtain

aj,t = ad,m

aj,t + ad,m aj+1,t+1

aj,t = aj+1,t+pj+1xj+1 aj+1,t+1 d 1 j t 1.

(9) Similarly, we introduce the adjoint pj := ad,m

pj j [d]. Since pj inﬂuences aj,t t [m], we have

t=1 aj,t aj 1,t 1 xj.

We can run recursion (9) in reverse order of the DP table starting from ad,m = ad,m

ad,m = 1. Using this approach, we can compute the entire gradient Am(p, x) = [ p1, . . . , pd]T w.r.t. p in O(dm) time and memory. The procedure is summarized in Algorithm 2.

Algorithm 1 Evaluating Am(p, x) in O(dm)

Input: p Rd, x Rd

aj,t 0 t [m], j [d] {0} aj,0 1 j [d] {0}

for t := 1, . . . , m do for j := t, . . . , d do aj,t aj 1,t + pjxjaj 1,t 1 end for end for

Output: Am(p, x) = ad,m

Algorithm 2 Computing Am(p, x) in O(dm)

Input: p Rd, x Rd, {aj,t}d,m j,t=0 aj,t 0 t [m + 1], j [d] ad,m 1 for t := m, . . . , 1 do for j := d 1, . . . , t do aj,t aj+1,t + aj+1,t+1pj+1xj+1 end for end for pj := Pm t=1 aj,taj 1,t 1xj j [d] Output: Am(p, x) = [ p1, . . . , pd]T

Stochastic gradient (SG) algorithms. Based on Algorithm 1 and 2, we can easily learn arbitraryorder HOFMs using any gradient-based optimization algorithm. Here we focus our discussion on SG algorithms. If we alternatingly minimize (4) w.r.t P (2), . . . , P (m), then the sub-problem associated with degree m is of the form

s=1 Am(ps, xi) + oi

2 P 2, (10)

where o1, . . . , on R are ﬁxed offsets which account for the contribution of degrees other than m to the predictions. The sub-problem is convex in each row of P [4]. A SG update for (10) w.r.t. ps for some instance xi can be computed by ps ps ηℓ (yi, ˆyi) Am(ps, xi) ηβps, where η is a learning rate and where we deﬁned ˆyi := Pk s=1 Am(ps, xi) + oi. Because evaluating Am(p, x) and computing its gradient both take O(dm), the cost per epoch, i.e., of visiting all instances, is O(mdkn). When m = 2, this is the same cost as the SG algorithm implemented in libfm.

Sparse data. We conclude this section with a few useful remarks on sparse data. Let us denote the support of a vector x = [x1, . . . , xd]T by supp(x) := {j [d]: xj = 0} and let us deﬁne x S := [xj : j S]T. It is easy to see from (7) that the gradient and x have the same support, i.e., supp( Am(p, x)) = supp(x). Another useful remark is that Am(p, x) = Am(psupp(x), xsupp(x)), provided that m nz(x), where nz(x) is the number of non-zero elements in x. Hence, when the data is sparse, we only need to iterate over non-zero features in Algorithm 1 and 2. Consequently, their time and memory cost is only O(nz(x)m) and thus the cost per epoch of SG algorithms is O(mknz(X)).

4 Coordinate descent algorithm for arbitrary-order HOFMs

We now describe a coordinate descent (CD) solver for arbitrary-order HOFMs. CD is a good choice for learning HOFMs because their objective function is coordinate-wise convex, thanks to the multilinearity of the ANOVA kernel [4]. Our algorithm can be seen as a generalization to higher orders of the CD algorithms proposed in [14, 4].

An alternative recursion. Efﬁcient CD implementations typically require maintaining statistics for each training instance, such as the predictions at the current iteration. When a coordinate is updated, the statistics then need to be synchronized. Unfortunately, the recursion we used in the previous section is not suitable for a CD algorithm because it would require to store and synchronize the DP table for each training instance upon coordinate-wise updates. We therefore turn to an alternative recursion:

Am(p, x) = 1

t=1 ( 1)t+1Am t(p, x)Dt(p, x), (11)

where we deﬁned Dt(p, x) := Pd j=1(pjxj)t. Note that the recursion was already known in the context of traditional kernel methods (c.f., [19, Section 11.8]) but its application to HOFMs is novel. Since we know that A0(p, x) = 1 and A1(p, x) = p, x , we can use (11) to compute A2(p, x), then A3(p, x), and so on. The overall evaluation cost for arbitrary m N is O(md + m2).

Coordinate-wise derivatives. We can apply reverse-mode differentiation to recursion (11) in order to compute the entire gradient (c.f., Appendix C). However, in CD, since we only need the derivative of one variable at a time, we can simply use forward-mode differentiation:

t=1 ( 1)t+1 Am t(p, x)

pj Dt(p, x) + Am t(p, x) Dt(p, x)

where Dt(p,x)

pj = tpt 1 j xt j. The advantage of (12) is that we only need to cache Dt(p, x) for t [m]. Hence the memory complexity per sample is only O(m) instead of O(dm) for (8).

Use in a CD algorithm. Similarly to [4], we assume that the loss function ℓis µ-smooth and update the elements pj,s of P in cyclic order by pj,s pj,s η 1 j,s F (P )

pj,s , where we deﬁned

2 + β and F(P )

i=1 ℓ (yi, ˆyi) Am(ps, xi)

pj,s + βpj,s.

The update guarantees that the objective value is monotonically non-increasing and is the exact coordinate-wise minimizer when ℓis the squared loss. Overall, the total cost per epoch, i.e., updating all coordinates once, is O(τ(m)knz(X)), where τ(m) is the time it takes to compute (12). Assuming Dt(ps, xi) have been previously cached, for t [m], computing (12) takes τ(m) = m(m+1)/2 1 operations. For ﬁxed m, if we unroll the two loops needed to compute (12), modern compilers can often further reduce the number of operations needed. Nevertheless, this quadratic dependency on m means that our CD algorithm is best for small m, typically m 4.

5 HOFMs with shared parameters

HOFMs, as originally deﬁned in [13, 14], model each degree with separate matrices P (2), . . . , P (m). Assuming that we use the same rank k for all matrices, the total model size of m-order HOFMs is therefore O(kdm). Moreover, even when using our O(dm) DP algorithm, the cost of computing predictions is O(k(2d + + md)) = O(kdm2). Hence, HOFMs tend to produce large, expensiveto-evaluate models. To reduce model size and prediction times, we introduce two new kernels which allow us to share parameters between each degree: the inhomogeneous ANOVA kernel and the all-subsets kernel. Because both kernels are derived from the ANOVA kernel, they share the same appealing properties: multi-linearity, sparse gradients and sparse-data friendliness.

5.1 Inhomogeneous ANOVA kernel

It is well-known that a sum of kernels is equivalent to concatenating their associated feature maps [18, Section 3.4]. Let θ = [θ1, . . . , θm]T. To combine different degrees, a natural kernel is therefore

A1 m(p, x; θ) :=

t=1 θt At(p, x). (13)

The kernel uses all feature combinations of degrees 1 up to m. We call it inhomogeneous ANOVA kernel, since it is an inhomogeneous polynomial of x. In contrast, Am(p, x) is homogeneous. The main difference between (13) and (6) is that all ANOVA kernels in the sum share the same parameters. However, to increase modeling power, we allow each kernel to have different weights θ1, . . . , θm.

Evaluation. Due to the recursive nature of Algorithm 1, when computing Am(p, x), we also get A1(p, x), . . . , Am 1(p, x) for free. Indeed, lower-degree kernels are available in the last column of the DP table, i.e., At(p, x) = ad,t t [m]. Hence, the cost of evaluating (13) is O(dm) time. The total cost for computing ˆy = Pk s=1 A1 m(ps, x; θ) is O(kdm) instead of O(kdm2) for ˆy HOFM(x).

Learning. While it is certainly possible to learn P and θ by directly minimizing some objective function, here we propose an easier solution, which works well in practice. Our key observation is that we can easily turn Am into A1 m by adding dummy values to feature vectors. Let us denote the concatenation of p with a scalar γ by [γ, p] and similarly for x. From (7), we easily obtain

Am([γ1, p], [1, x]) = Am(p, x) + γ1Am 1(p, x).

Table 2: Datasets used in our experiments. For a detailed description, c.f. Appendix A.

Dataset n+ Columns of A n A d A Columns of B n B d B NIPS [17] 4,140 Authors 2,037 13,649 Enzyme [21] 2,994 Enzymes 668 325 GD [10] 3,954 Diseases 3,209 3,209 Genes 12,331 25,275 Movielens 100K [6] 21,201 Users 943 49 Movies 1,682 29

Similarly, if we apply (7) twice, we obtain:

Am([γ1, γ2, p], [1, 1, x]) = Am(p, x) + (γ1 + γ2)Am 1(p, x) + γ1γ2Am 2(p, x).

Applying the above to m = 2 and m = 3, we obtain

A2([γ1, p], [1, x]) = A1 2(p, x; [γ1, 1]) and A3([γ1, γ2, p], [1, 1, x]) = A1 3(p, x; [γ1γ2, γ1+γ2, 1]).

More generally, by adding m 1 dummy features to p and x, we can convert Am to A1 m. Because p is learned, this means that we can automatically learn γ1, . . . , γm 1. These weights can then be converted to θ1, . . . , θm by unrolling recursion (7). Although simple, we show in our experiments that this approach works favorably compared to directly learning P and θ. The main advantage of this approach is that we can use the same software unmodiﬁed (we simply need to minimize (10) with the augmented data). Moreover, the cost of computing the entire gradient by Algorithm 2 using the augmented data is just O(dm + m2) compared to O(dm2) for HOFMs with separate parameters.

5.2 All-subsets kernel

We now consider a closely related kernel called all-subsets kernel [18, Deﬁnition 9.5]:

j=1 (1 + pjxj).

The main difference with the traditional use of this kernel is that we learn p. Interestingly, it can be shown that S(p, x) = 1 + A1 d(p, x; 1) = 1 + A1 nz(x)(p, x; 1), where nz(x) is the number of non-zero features in x. Hence, the kernel uses all combinations of distinct features up to order nz(x) with uniform weights. Even if d is very large, the kernel can be a good choice if each training instance contains only a few non-zero elements. To learn the parameters, we simply substitute Am with S in (10). In SG or CD algorithms, all it entails is to substitute Am(p, x) with S(p, x). For computing S(p, x), it is easy to verify that S(p, x) = S(p j, x j)(1 + pjxj) j [d] and therefore we have

S(p, x) = x1 S(p 1, x 1), . . . , xd S(p d, x d) T = x1 S(p, x)

1 + p1x1 , . . . , xd S(p, x)

Therefore, the main advantage of the all-subsets kernel is that we can evaluate it and compute its gradient in just O(d) time. The total cost for computing ˆy = Pk s=1 S(ps, x) is only O(kd).

6 Experimental results

6.1 Application to link prediction

Problem setting. We now demonstrate a novel application of HOFMs to predict the presence or absence of links between nodes in a graph. Formally, we assume two sets of possibly disjoint nodes of size n A and n B, respectively. We assume features for the two sets of nodes, represented by matrices A Rd A n A and B Rd B n B. For instance, A can represent user features and B movie features. We denote the columns of A and B by ai and bj, respectively. We are given a matrix Y {0, 1}n A n B, whose elements indicate presence (positive sample) or absence (negative sample) of link between two nodes ai and bj. We denote the number of positive samples by n+. Using this data, our goal is to predict new associations. Datasets used in our experiments are summarized in Table 2. Note that for the NIPS and Enzyme datasets, A = B.

Conversion to a supervised problem. We need to convert the above information to a format FMs and HOFMs can handle. To predict an element yi,j of Y , we simply form xi,j to be the concatenation

Table 3: Comparison of area under the ROC curve (AUC) as measured on the test sets.

NIPS Enzyme GD Movielens 100K HOFM (m = 2) 0.856 0.880 0.717 0.778 HOFM (m = 3) 0.875 0.888 0.717 0.786 HOFM (m = 4) 0.874 0.887 0.717 0.786 HOFM (m = 5) 0.874 0.887 0.717 0.786 HOFM-shared-augmented (m = 2) 0.858 0.876 0.704 0.778 HOFM-shared-augmented (m = 3) 0.874 0.887 0.704 0.787 HOFM-shared-augmented (m = 4) 0.836 0.824 0.663 0.779 HOFM-shared-augmented (m = 5) 0.824 0.795 0.600 0.621 HOFM-shared-simplex (m = 2) 0.716 0.865 0.721 0.701 HOFM-shared-simplex (m = 3) 0.777 0.870 0.721 0.709 HOFM-shared-simplex (m = 4) 0.758 0.870 0.721 0.709 HOFM-shared-simplex (m = 5) 0.722 0.869 0.721 0.709 All-subsets 0.730 0.840 0.721 0.714 Polynomial network (m = 2) 0.725 0.879 0.721 0.761 Polynomial network (m = 3) 0.789 0.853 0.719 0.696 Polynomial network (m = 4) 0.782 0.873 0.717 0.708 Polynomial network (m = 5) 0.543 0.524 0.648 0.501 Low-rank bilinear regression 0.855 0.694 0.611 0.718

of ai and bj and feed this to a HOFM in order to compute a prediction ˆyi,j. Because HOFMs use feature combinations in xi,j, they can learn the weights of feature combinations between ai and bj. At training time, we need both positive and negative samples. Let us denote the set of positive and negative samples by Ω. Then our training set is composed of (xi,j, yi,j) pairs, where (i, j) Ω.

Models compared.

HOFM: ˆyi,j = ˆy HOFM(xi,j) as deﬁned in (3) and as originally proposed in [13, 14]. We minimize (4) by alternating minimization of (10) for each degree.

HOFM-shared: ˆyi,j = Pk s=1 A1 m(ps, xi,j; θ). We learn P and θ using the simple augmented data approach described in Section 5.1 (HOFM-shared-augmented). Inspired by Simple MKL [12], we also report results when learning P and θ directly by minimizing 1 |Ω| P

(i,j) Ωℓ(yi,j, ˆyi,j) +

β 2 P 2 subject to θ 0 and θ, 1 = 1 (HOFM-shared-simplex).

All-subsets: ˆyi,j = Pk s=1 S(ps, xi,j). As explained in Section 5.2, this model is equivalent to the HOFM-shared model with m = nz(xi,j) and θ = 1.

Polynomial network: ˆyi,j = Pk s=1(γs + ps, xi,j )m. This model can be thought as factorization machine variant that uses a polynomial kernel instead of the ANOVA kernel (c.f., [8, 4, 22]).

Low-rank bilinear regression: ˆyi,j = ai UV Tbj, where U Rd A k and V Rd B k. Such model was shown to work well for link prediction in [9] and [10]. We learn U and V by minimizing 1 |Ω| P

(i,j) Ωℓ(yi,j, ˆyi,j) + β

2 ( U 2 + V 2).

Experimental setup and evaluation. In this experiment, for all models above, we use CD rather than SG to avoid the tuning of a learning rate hyper-parameter. We set ℓto be the squared loss. Although we omitted it from our notation for clarity, we also ﬁt a bias term for all models. We evaluated the compared models using the area under the ROC curve (AUC), which is the probability that the model correctly ranks a positive sample higher than a negative sample. We split the n+ positive samples into 50% for training and 50% for testing. We sample the same number of negative samples as positive samples for training and use the rest for testing. We chose β from 10 6, 10 5, . . . , 106 by cross-validation and following [9] we empirically set k = 30. Throughout our experiments, we initialized the elements of P randomly by N(0, 0.01).

Results are indicated in Table 3. Overall the two best models were HOFM and HOFM-sharedaugmented, which achieved the best scores on 3 out of 4 datasets. The two models outperformed low-rank bilinear regression on 3 out 4 datasets, showing the beneﬁt of using higher-order feature combinations. HOFM-shared-augmented achieved similar accuracy to HOFM, despite using a smaller model. Surprisingly, HOFM-shared-simplex did not improve over HOFM-shared-augmented except

(a) Convergence when m = 2 (b) Convergence when m = 3

(c) Convergence when m = 4 (d) Scalability w.r.t. degree m

Figure 1: Solver comparison for minimizing (10) when varying the degree m on the NIPS dataset with β = 0.1 and k = 30. Results on other datasets are in Appendix B.

on the GD dataset. We conclude that our augmented data approach is convenient yet works well in practice. All-subsets and polynomial networks performed worse than HOFM and HOFM-sharedaugmented, except on the GD dataset where they were the best. Finally, we observe that HOFM were quite robust to increasing m, which is likely a beneﬁt of modeling each degree with a separate matrix.

6.2 Solver comparison

We compared Ada Grad [5], L-BFGS and coordinate descent (CD) for minimizing (10) when varying the degree m on the NIPS dataset with β = 0.1 and k = 30. We constructed the data in the same way as explained in the previous section and added m 1 dummy features, resulting in n = 8, 280 sparse samples of dimension d = 27, 298 + m 1. For Ada Grad and L-BFGS, we computed the (stochastic) gradients using Algorithm 2. All solvers used the same initialization.

Results are indicated in Figure 1. We see that our CD algorithm performs very well when m 3 but starts to deteriorate when m 4, in which case L-BFGS becomes advantageous. As shown in Figure 1 d), the cost per epoch of Ada Grad and L-BFGS scales linearly with m, a beneﬁt of our DP algorithm for computing the gradient. However, to our surprise, we found that Ada Grad is quite sensitive to the learning rate η. Ada Grad diverged for η {1, 0.1, 0.01} and the largest value to work well was η = 0.001. This explains why Ada Grad did not outperform CD despite the lower cost per epoch. In the future, it would be useful to create a CD algorithm with a better dependency on m.

7 Conclusion and future directions In this paper, we presented the ﬁrst training algorithms for HOFMs and introduced new HOFM variants with shared parameters. A popular way to deal with a large number of negative samples is to use an objective function that directly maximize AUC [9, 15]. This is especially easy to do with SG algorithms because we can sample pairs of positive and negative samples from the dataset upon each SG update. We therefore expect the algorithms developed in Section 3 to be especially useful in this setting. Recently, [7] proposed a distributed SG algorithm for training second-order FMs. It should be straightforward to extend this algorithm to HOFMs based on our contributions in Section 3. Finally, it should be possible to integrate Algorithm 1 and 2 into a deep learning framework such as Tensor Flow [1], in order to easily compose ANOVA kernels with other layers (e.g., convolutional).

[1] M. Abadi et al. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015.

[2] A. G. Baydin, B. A. Pearlmutter, and A. A. Radul. Automatic differentiation in machine learning: a survey. ar Xiv preprint ar Xiv:1502.05767, 2015.

[3] M. Blondel, A. Fujino, and N. Ueada. Convex factorization machines. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2015.

[4] M. Blondel, M. Ishihata, A. Fujino, and N. Ueada. Polynomial networks and factorization machines: New insights and efﬁcient training algorithms. In Proceedings of International Conference on Machine Learning (ICML), 2016.

[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

[6] Group Lens. http://grouplens.org/datasets/movielens/, 1998.

[7] M. Li, Z. Liu, A. Smola, and Y.-X. Wang. Difacto distributed factorization machines. In Proceedings of International Conference on Web Search and Data Mining (WSDM), 2016.

[8] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efﬁciency of training neural networks. In Advances in Neural Information Processing Systems, pages 855 863, 2014.

[9] A. K. Menon and C. Elkan. Link prediction via matrix factorization. In Machine Learning and Knowledge Discovery in Databases, pages 437 452. 2011.

[10] N. Natarajan and I. S. Dhillon. Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12):i60 i68, 2014.

[11] V. Y. Pan. Structured Matrices and Polynomials: Uniﬁed Superfast Algorithms. Springer-Verlag New York, Inc., 2001.

[12] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491 2521, 2008.

[13] S. Rendle. Factorization machines. In Proceedings of International Conference on Data Mining, pages 995 1000. IEEE, 2010.

[14] S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):57 78, 2012.

[15] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-ﬁfth conference on uncertainty in artiﬁcial intelligence, pages 452 461, 2009.

[16] S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In SIGIR, pages 635 644, 2011.

[17] S. Roweis. http://www.cs.nyu.edu/~roweis/data.html, 2002.

[18] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

[19] V. Vapnik. Statistical learning theory. Wiley, 1998.

[20] G. Wahba. Spline models for observational data, volume 59. Siam, 1990.

[21] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468 i477, 2005.

[22] J. Yang and A. Gittens. Tensor machines for learning target-speciﬁc polynomial features. ar Xiv preprint ar Xiv:1504.01697, 2015.