# rafm_rankaware_factorization_machines__0aafed15.pdf

Ra FM: Rank-Aware Factorization Machines

Xiaoshuang Chen 1 Yin Zheng 2 Jiaxing Wang 3 Wenye Ma 2 Junzhou Huang 2

Fatorization machines (FM) are a popular model class to learn pairwise interactions by a low-rank approximation. Different from existing FM-based approaches which use a ﬁxed rank for all features, this paper proposes a Rank-Aware FM (Ra FM) model which adopts pairwise interactions from embeddings with different ranks. The proposed model achieves a better performance on real-world datasets where different features have signiﬁcantly varying frequencies of occurrences. Moreover, we prove that the Ra FM model can be stored, evaluated, and trained as efﬁciently as one single FM, and under some reasonable conditions it can be even signiﬁcantly more efﬁcient than FM. Ra FM improves the performance of FMs in both regression tasks and classiﬁcation tasks while incurring less computational burden, therefore also has attractive potential in industrial applications.

1. Introduction

Factorization machines (FM) (Rendle, 2010; 2012) are one of the most popular models to leverage the interactions between features. It models nested feature interactions via a factorized parametrization, and achieves success in many sparse predictive areas, such as recommender systems and click-through rate predictions (Juan et al., 2016). Recently there are many follow-up researches, such as high-order FMs (Blondel et al., 2016), convex FMs (Blondel et al., 2015), neural-network-based FMs (He & Chua, 2017; Guo et al., 2017), locally linear FMs (Liu et al., 2017), etc.

Most of the existing approaches allocate a ﬁxed rank of embedding vectors to each feature (Liu et al., 2017; Guo et al., 2017; Zheng et al., 2016b;a; Du et al., 2018; Lauly et al., 2017; Jiang et al., 2017). However, the frequencies

1Department of Electrical Engineering, Tsinghua University, Beijing, China 2Tencent AI Lab, Shenzhen, China 3Institute of Automation, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, Beijing, China. Correspondence to: Yin Zheng <yinzheng@tencent.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

1-2 3-4 5-8 9-16 17-32 33-64 65-128 >128 #Occurrences (a) ML Tag

1-2 3-4 5-8 9-16 17-32 33-64 65-128 >128 #Occurrences (b) Avazu

Figure 1. Occurrences of different features in (a) ML Tag; (b) Avazu. Example: in ML Tag, more than 40,000 features only occur once or twice, while less than 5,000 features occur more than 128 times.

of different features occurring in real-world datasets vary a lot. Fig. 1 shows the numbers of occurrences of different features in two public datasets, i.e. Movie Lens1 Tag and Avazu2 respectively. It is shown that a large number of features scarsely occur in the dataset, while only a few features frequently occur. Fixing the rank of FMs may lead to overﬁtting problems for features with few occurrences, and underﬁtting problems for features with many occurrences.

There are also researches allocating with different ranks to each feature(Li et al., 2016; 2017; Juan et al., 2016). However, training and storing multiple embeddings lead to unaffordable computational burden, and the large number of parameters may even aggravate the overﬁtting problem (Li et al., 2017). To this end, this paper proposes a Rank-Aware FM (Ra FM) model, i.e., to maintain embedding vectors with different ranks for each feature, which makes it possible to compute each pairwise interaction via proper ranks of embedding vectors. A key contribution of this paper is to prove that although Ra FM maintains multiple embeddings with different ranks for each feature, it can be stored, evaluated, and trained as efﬁciently as, or even more efﬁciently than a single FM with a ﬁxed rank.

1https://grouplens.org/datasets/movielens 2https://www.kaggle.com/c/ avazu-ctr-prediction/data

Ra FM: Rank-Aware Factorization Machines

Speciﬁcally, we analyze the time and space complexity of Ra FM, and show that there are many inactive factors which need not be stored and evaluated. Therefore, both the computational time and the storage can be signiﬁcantly reduced, and the entire computational burden can be even smaller than that of a single FM under some reasonable conditions. Furthermore, we provide an algorithm to train all embedding vectors of Ra FM in one concise model where the inactive factors will not be stored and trained, and then prove the convergence rate and the performance bound of the training algorithm. Thus, the computational burden of the training process can also be effectively reduced. Experiments show the effectiveness of Ra FM in both public datasets and datasets of industrial applications. The proposed Ra FM model improves the performance of FMs while incurring a comparable or even less computational burden, therefore also has attractive potential in industrial applications.

2. Problem Formulation

2.1. General Interaction Form of FMs

Here we provide a general interaction form of the FM model:

i,j F,i<j Vi, Vj xixj + X

i wixi + bias (1)

where F is the index set of features, xi is the value of the i-th feature, wi is the weight of the i-th feature in the linear part, and bias is the bias of the model. Vi is the embedding of the i-th feature, describing the characteristics of this feature when interacting with other features. Vi, Vj represents the bi-interaction between the i-th feature and the j-th feature.

Clearly, the computation of the bi-interaction Vi, Vj highly depends on the formulations of the embedding term Vi. In the original FM model, Vi is a vector of a ﬁxed rank, and Vi, Vj is the dot product. Speciﬁcally, assume we have m FMs, the rank of the k-th of which is Dk, then we have

Vi, Vj F Mk = v(k) i v(k) j =

f=1 v(k) i,f v(k) j,f (2)

where v(k) i is the embedding with dimension Dk, and v(k) i,f is the f-th coordinate of v(k) i . We assume the m FMs are ordered such that D1 D2 Dm. Let D = Dm. We use v(k) to represent the parameter set {v(k) i : i F}.

According to Fig. 1, the frequencies of different features occurring in real-world datasets vary a lot, hence ﬁxed rank FMs may have unsatisfying performance on these datasets. Speciﬁcally, a certain embedding Vi will be trained only when xi are both nonzero. Therefore in the k-th FM, the embedding Vi = v(k) i will suffer from underﬁtting if the i-th feature frequently occurs, and from overﬁtting if the feature scarsely occurs, as shown in Fig. 2(a)(b).

Figure 2. The intuition of Ra FM. (a) FM with a high dimension; (b) FM with a low dimension; (c) Ra FM

2.2. Ra FM: Rank-Aware Factorization Machines

From the abovementioned discussions, it is beneﬁcial to allocate different ranks of embedding vectors to different features. In this paper, we allocate multiple embedding vectors to each feature in Ra FM, i.e.,

Vi = n v(1) i , v(2) i , , v(ki) i o (3)

where the rank of the k-th vector is Dk. The value ki means that the maximum rank of the embedding vectors is Dki. ki can be interpreted as: if k > ki, the rank of v(k) i will be too large compared to the occurrences of the i-th feature, which will lead to the overﬁtting problem. ki can be chosen according to the number of occurrences of the i-th feature, and in this paper, we regard ki as hyperparameters. Although some boosting methods (Li et al., 2018) might be used to select appropriate ki, we leave it for future work.

Here we emphasize that the multiple embeddings in Vi are not independent in the Ra FM model. Intuitively speaking, v(1) i can be regarded as a projection of v(2) i from the D2dimensional space to the D1-dimensional space. We will return to this point in Section 4, which provides an efﬁcient training problem maintaining this property.

The pairwise interaction can be computed as

Vi, Vj Ra F M = v(kij) i v(kij) j (4)

where kij = min(ki, kj) (5)

which means the dimensionality to compute the pairwise interaction between the i-th and j-th features depends on the maximum common dimensionality of the embedding vectors of the i-th and the j-th feature. Therefore, the computation of Vi, Vj achieves a good trade-off between overﬁtting problems and underﬁtting problems, which is the intuition why Ra FM outperforms FM. Fig. 2 also illustrates the idea of Ra FM.

A key challenge of Ra FM is the computational burden when storing, evaluating, and training Ra FM:

1. It is well-known that FM can be evaluated efﬁciently in O(D |F|). However, the straight computation of Ra FM, i.e. (1)(4), is O(D |F|2), which is not satifying.

Ra FM: Rank-Aware Factorization Machines

2. To train the Ra FM, we need to train multiple embedding vectors for a feature, which incurs a large computational burden and storage consumption.

We separate the discussions into two sections: Section 3 shows that Ra FM can be stored and evaluated as efﬁcient, or even more efﬁcient than a single FM, whereas Section 4 provides an efﬁcient training algorithm.

3. Model Complexity of Ra FM

This section analyzes the space and time complexity of Ra FM. Trivial bounds of the space and time complexity are O(Pm k=1 Dk |F|) and O(D |F|2) respectively. In this section, they will be signiﬁcantly improved to be comparable or even less than a single FM under reasonable conditions.

3.1. Space Complexity

Before discussing the complexity of Ra FM, we introduce a class of feature sets, i.e.

Fk = {i F : ki k}, k = 1, 2, , m (6)

then we have F1 F2 Fm. We have F = F1 due to ki 1 by deﬁnition.

Although the Ra FM in Eq. (3) contains m group of embeddings with different ranks, not all the parameters each group of embeddings will be used. Speciﬁcally, if we use F Fk to denote the set difference of F and Fk, then for i F Fk and l > k, the factor v(l) i will never be used according to Eq. (4). These factors are called inactive factors. In other words, only v(k) i , i Fk need to be maintained, which are called active factors. Therefore we have Proposition 1. The space complexity of parameters in Ra FM (3) is O (Pm k=1 Dk |Fk|).

3.2. Time Complexity

To derive the time complexity of Ra FM, we introduce two notations, i.e. Al,k and Bl,k. Speciﬁcally,

i,j Fk,i<j v(l) i v(l) j xixj

i Fk v(l) i xi

v(l) i xi 2

And when l k, Bl,k is deﬁned as

i<j v (kij|[l,k]) i v (kij|[l,k]) j xixj (8)

where kij|[l,k] = max[l, min(k, kij)].

Al,k and Bl,k are introduced in order to transform the original Ra FM in Eq. (1)(4) to a computationally efﬁcient form. Actually, the following properties are obvious:

Proposition 2. We have the following properties regarding Al,k and Bl,k:

1. Ak,1 and Bk,k both denote the k-th FM model in Eq.(2).

2. The Ra FM model in Eq. (1)(4) is B1,m;

3. The computational complexity of Al,k is O(Dl |Fk|) due to the second equality in Eq. (7).

The proof is omitted since these statements are obvious. According to Proposition 2, the time complexity of Al,k is known, and all we need to do is to compute Bl,k according to A. To achieve this, we provide the following theorem: Theorem 3. Al,k and Bl,k satisfy the following equality: Bl,k+1 = Bl,k Ak,k+1 + Ak+1,k+1

Proof. It is easy to show that

Bl,k+1 = Bl,k X

i<j,kij>k v(k) i v(k) j xixj

i<j,kij>k v(k+1) i v(k+1) j xixj (9)

Moreover, according to (4), the feature set {i < j : kij > k} can be rewritten as

{i < j : kij > k} = {i < j : i, j Fk+1} (10)

Then we have the 4th property according to the deﬁnition of Ak,k+1 and Ak+1,k+1.

Corollary 4. Regarding the computational complexity of Ra FM, we have

1. The time complexity of Bl,k is

O Dl |F| + Pk p=l+1 Dp |Fp| ;

2. The time complexity of the Ra FM model (3), i.e. B1,m, is O (Pm k=1 Dk |Fk|).

Proof. We only prove the 1st statement, and the 2nd statement is the direct corollary of the 1st when l = 1, k = m.

We prove by induction. When k = l, the time complexity of Bl,l is O (Dl |F|). If the time complexity of Bl,k is

O Dl |F| + Pk p=l+1 Dp |Fp| , then the time complexity of Bl,k+1 should be

p=l+1 Dp |Fp|

+ O (Dk |Fk+1|)

+ O (Dk+1 |Fk+1|) = O

p=l+1 Dp |Fp|

Ra FM: Rank-Aware Factorization Machines

Table 1. Complexity of Ra FM under different conditions

m D) Dk = Θ(2k m D) |Fk| = Θ((1 k 1

m ) |F|) O(m D |F|) O(D |F|) |Fk| = Θ(21 k |F|) O(D |F|) O( m 2m 1 D |F|)

where the fact Dk Dk+1 is used.

Remark 5. Similar to FM, when the data is sparse, |Fk| should be replaced by n(Fk), which is the expected number of occurrences of Fk in a data sample. In such cases, the time complexity is in the sense of expectation.

3.3. Comparison with FM

This section compares the complexity of Ra FM with that of FM. When using a single FM as the predictor, we usually use a large rank to ensure the performance, and use regularization to avoid overﬁtting. Here we use the FM with rank Dm for comparison, of which the space and time complexities are O(D |F|) and O(Dn(F)) respectively.

The space and time complexity of Ra FM are similar to each other except that |Fk| should be replaced by n(Fk) in the time complexity, so we discuss them together. Generally, when k increases, Dk will increase and |Fk| will decrease. Table 1 provides the complexity under different speeds of Dk increasing and |Fk| decreasing. It is shown that if Dk increases and |Fk| decreases moderately (e.g. linearly), the complexity of Ra FM will be large. However, if one or two of them vary rapidly (e.g. exponentially), the complexity of Ra FM will be comparable or smaller than the FM model.

In practice, it is widely-accepted that Dk varies rapidly when k increases (He & Chua, 2017; Li et al., 2017). As indicated by Fig. 1, |Fk| decreases rapidly when k increases. In contrast, the speed of n(Fk) decreasing may not be as rapidly as that of |Fk|, but is still likely to be superlinear. Therefore, in such conditions, Ra FM will signiﬁcantly reduce the space complexity of FM while incur a comparable or smaller time complexity. This statement will also be validated by experiments in Section 6.1.3.

4. Efﬁcient Learning of Ra FM

This section provides a computationally efﬁcient learning algorithm of Ra FM, i.e. Algorithm 1, where the inactive factors need not be stored and trained. We ﬁrst provide the objective function and the learning algorithm, then prove that the proposed algorithm is to train the upper bounds of all the m FMs simultaneously.

We emphasize that although the analysis in this section is somehow technical, the ﬁnal algorithm, i.e., Algorithm 1, can be regarded as an extension to SGD methods, hence is easy to implement.

Algorithm 1 Training the Ra FM

1: Initialize all the parameters 2: while not convergent do 3: Sample a data point (x, y) randomly 4: for 1 p < m do 5: v(p) Fp+1 v(p) Fp+1 ρd L(B1,p,B1,p+1)

v(p)|Fp+1 6: v(p) Fp Fp+1 v(p) Fp Fp+1 ρf L(B1,m,y) v(p)|Fp Fp+1 7: end for 8: v(m) Fm v(m) Fm ρf L(B1,m,y)

v(m)|Fm 9: end while

4.1. Constrained Optimization of Ra FM

The goal of the training algorithm is to obtain the parameters in B1,m. According to Eq. (7) and Theorem 3, the parameters are v(p) Fp , 1 p m. In other words,

v(p) F Fp , 1 p m are the inactive factors that need not be trained. In order to avoid the training of inactive factors, we provide the following bi-level optimization model:

x L(B1,m, y) (12a)

s.t. v(p) Fp+1 = arg min 1

x L(B1,p, B1,p+1), 1 p < m

where L is the loss function, and v(p) Fp+1 denotes the v(p) i where i Fp+1.

The variables in Eq. (12) can be classiﬁed into two groups, i.e. free variables and dependent variables. v(p) Fp Fp+1 , 1 p < m and v(m) Fm are free vari-

ables, while v(p) Fp+1 , 1 p < m are dependent vari-

ables since they are determined by v(p+1) Fp+1 via the constraint (12b). The basic idea of Eq. (12) is to regard vi in its highest dimensionality as free variables to be optimized, and to regard other lower-dimensionality counterparts as its projections in lower dimensions, which can be approximated by the constraint (12b). Note that inactive factors v(p) F Fp , 1 p m does not exist in Eq. (12). Therefore, Eq. (12) makes it possible to maintain the model size given by Proposition 1 in the training process.

4.2. Learning

This subsection will show that Eq. (12) can be efﬁciently trained. Due to (12b), the dependent variables v(p) Fp+1 can be obtained by taking the stochastic gradient descent (SGD) algorithm on the ﬁrst term in L(B1,p, B1,p+1). The major challenge is the estimation of the gradient direction

Ra FM: Rank-Aware Factorization Machines

of the free variables due to that the optimization problem (12) is a multi-stage optimization problem. Taking v(p) Fp Fp+1 for a certain p as an example, the gradient

with respect to v(p) Fp Fp+1 should be

L(B1,m, y) v(p) Fp Fp+1

+ L(B1,m, y)

v(p 1) Fp v(p) Fp Fp+1

where v(p 1) Fp / v(p) Fp Fp+1 is a |Fp| Dp 1 (|Fp| |Fp+1|)Dp matrix representing the relationship between the dependent variable v(p 1) Fp and the free vari-

able v(p) Fp Fp+1 given by the constraint (12b). According to the theorem of implicit functions,

v(p 1) Fp v(p) Fp Fp+1 =

2L(B1,p 1, B1,p)

2L(B1,p 1, B1,p) v(p 1) Fp v(p) Fp Fp+1

Apply Eq. (14) to Eq. (13),

L(B1,m, y) v(p) Fp Fp+1

v(p 1) Fp G 1 2L(B 1,p 1, B 1,p)

v(p 1) Fp v(p) Fp Fp+1 (15)

where G = 1

2L/ v(p 1) Fp

2 is the ﬁrst term

in the right of (14), which is a |Fp| Dp 1 |Fp| Dp 1 matrix. B l,k denotes the Bl,k of input vector x . By exchanging x and x in the second term of the right of (15), and using one sample x to estimate the gradient, we have

[ grad = L(B1,m, y) v(p) Fp Fp+1

L(B 1,m, y)

v(p 1) Fp G 1 2L(B1,p 1, B1,p) v(p 1) Fp v(p) Fp Fp+1 (16)

In the right of Eq. (16), the ﬁrst term can be obtain by the chain rule, while the second term contains second order derivatives which are challenging to compute. However, we have the following theorem:

Theorem 6. The direction of [ grad is parallel to its ﬁrst term L(B1,m, y)/ v(p) Fp Fp+1.

Proof. See Section 1 in the supplementary material.

In SGD, the direction of the gradient vector is more important than the length. Theorem 6 shows that we can use the ﬁrst term in Eq. (16) to estimate the descent direction over free variables. The same discussion can be applied to v(m) Fm, therefore we can use Algorithm 1 to learn Eq. (12), where ρd and ρf are the learning rates of dependent variables and free variables, respectively.

Now we discuss the complexity of Algorithm 1. Note that in each step, only active factors are used and updated, which means the space complexity follows Proposition 1. Moreover, we do not need to compute B1,p seperately since the it is a part of B1,m. Therefore, the time complexity also follows Corollary 4. Therefore, as discussed in Section 3.3, training Ra FM via Algorithm 1 is as efﬁcient as, or more efﬁcient than training a single FM.

Moreover, it can be proven that the proposed learning algorithm minimizes an upper bound of each FM with a single rank Dk, 1 k m. We put the proof in Section 2 of the supplementary material due to space constraints.

5. Related Work

The most related researches on the combination of FMs with different ranks are Di Facto(Li et al., 2016), MRMA(Li et al., 2017), and FFM(Juan et al., 2016). We compare them with Ra FM thoroughly in this section. There are many other researches to improve the performance of FMs by deep models, such as NFM(He & Chua, 2017), AFM(Xiao et al., 2017), Deep FM(Guo et al., 2017), etc. The idea of Ra FM is orthogonal to these researches. It is also an attractive direction to combine Ra FM with these models, and we leave it for future work.

5.1. Di Facto

Di Facto also uses multiple ranks in one FM. However, for each feature, Di Facto only allocates an embedding vector with a single rank, while Ra FM allocates multiple embeddings with different ranks. In Di Facto, the pairwise interaction between features with different ranks is obtained by simply truncating the embedding with a higher rank to a lower rank. In cases such as SVD of a complete matrix, such truncations are reasonable, since the best k-rank approximation is equivalent to the k-preﬁx of any k + n-rank approximation. However, in recommender systems where the training samples are highly sparse, such truncations usually lead to worse performances, as will be shown in Section 6. Therefore, Di Facto reduces the computational burden of FM by sacriﬁcing the performance, while Ra FM improves the performance of FM with a lower computational burden.

Ra FM: Rank-Aware Factorization Machines

32 64 128 256 512 Ra FM-low Ra FM Model

FM Ra FM-Low Ra FM

Figure 3. FM vs. Ra FM in ML Tag

Table 2. Performance and Complexity under Different Settings

logloss #param train time FM(D=512) 0.2538 46.40M 1 Ra FM(S1) 0.2405 17.89M 0.43 Ra FM(S2) 0.2416 9.63M 0.17 Ra FM(S3) 0.2387 9.23M 0.24 Ra FM(S4) 0.2391 17.75M 1.35

MRMA is a matrix approximation model, of which the key idea is also to combine models with different ranks. The major difference is that it stores the entire models with different ranks, thus leading to a large computational burden and storage burden. Moreover, the large number of parameters may also cause severe overﬁtting problems, and this is why (Li et al., 2017) provides the Iterated Conditional Modes (ICM) to train MRMA. In contrast, Ra FM will signiﬁcantly reduce the number of parameters by eliminating inactive factors, and will not be likely to suffer from overﬁtting problems.

Ra FM is totally different from FFM, although they both use multiple embeddings for each feature. On the one hand, FFM uses different embeddings for interactions between different ﬁeld-pairs, while the concept ﬁeld never exists in Ra FM. For problems without the concept of ﬁelds or problems with only 2 ﬁelds, FFM fails or degenerates to the original FM, while Ra FM still works. On the other hand, different embeddings in FFM are independent, while embeddings in a lower rank can be regarded as the projection of embeddings in higher rank, which is guaranteed by the learning algorithm. This property largely reduces the computational burden and avoids the overﬁtting problem, while FFM suffers from the large computational burden.

6. Experiments

This section provides the experiments of the proposed approach and other benchmarks in several datasets. In Experiment A, we test our approach on 7 public datasets while in Experiment B, we perform the Ra FM approach on the news CTR data provided by Tencent to show the

effectiveness of the proposed approach in industrial applications. The code of Ra FM is available at https: //github.com/cxsmarkchan/Ra FM.

6.1. Experiment A: Public Datasets

6.1.1. EXPERIMENT SETUP

We consider 3 datasets for regression tasks and 4 for classiﬁcation tasks. All these datasets are randomly split into train (80%), validation (10%), and test (10%) sets. Datasets for regression tasks are the Movie Lens 10M (ML 10M), 20M (ML 20M), and the Amazon movie review dataset3

(AMovie), respectively, of which the square loss is used as the performance criterion. Datasets for classiﬁcation tasks are Frappe4, Movielens Tag (ML Tag), Avazu, and Criteo5, respectively, of which the log loss and the area under curve (AUC) are used as the performance criteria.

We use the standard FM (Rendle, 2010) and Difacto (Li et al., 2016) as baselines for all datasets, and MRMA as a baseline for datasets containing exactly two ﬁelds, i.e. ML 10M, ML 20M, and AMovie, respectively. We do not compare with deep models since the motivation of this work is to show that FMs with different ranks can be efﬁciently combined and trained while incurring less computational burden. However, we argue that Ra FM is a ﬂexible framework and deep FMs can be embedded. We adopt L2 regularizations for each model, and search the L2 coefﬁcient from {1e 6, 5e 6, 1e 5, . . . , 1e 1} on the validation set. We search the ranks from {32, 64, 128, 256, 512} for each model, except for some discussions in Section 6.1.3, where we use more values. We compare the performance, the model size, the training time, and the test time. Since Ra FM has more hyperparameters, to be fair, we tune ki by the following equation rather than grid search:

ki = arg min k |log ni log Dk| (17)

where ni is the number of occurrences of the i-th feature. This equation means choosing ki so that Dk is the closest to ni in the sense of logarithm. Moreover, we use the same L2 coefﬁcient for all FM models in the Ra FM, and search it from the same candidate set as baselines.

6.1.2. ADVANTAGES OF RAFM OVER FM WITH FIXED RANKS

Fig. 3 shows the comparison between FMs with different ranks and Ra FM in the ML Tag dataset. The ranks of FMs range from 32 to 512, and the hyperparameters of Ra FM

3http://jmcauley.ucsd.edu/data/amazon/ 4http://baltrunas.info/research-menu/ frappe

5http://labs.criteo.com/2014/02/ kaggle-display-advertising-challenge-dataset/

Ra FM: Rank-Aware Factorization Machines

Table 3. Results on Regression Tasks ML 10M ML 20M AMovie square #param train/test square #param train/test square #param train/test loss time loss time loss time

FM 0.8016 2.66M 1 0.8002 5.45M 1 1.0203 3.25M 1 0.0010 0.0008 0.0046

Di Facto 0.7950 1.79M 0.82 / 0.7948 3.22M 0.70 / 1.0268 1.76M 0.75 / 0.0011 0.95 0.0005 0.80 0.0051 0.75

MRMA 0.7952 4.11M 1.27 / 0.7855 8.43M 1.19 / 1.0071 5.02M 1.27 / 0.0006 1.43 0.0011 1.38 0.0039 1.27

Ra FM 0.7870 1.57M 0.95 / 0.7807 3.63M 0.74 / 0.9986 1.76M 0.75 / 0.0008 1.12 0.0009 0.85 0.0035 0.75

Table 4. Results on Classiﬁcation Tasks Frappe ML Tag log loss AUC #param train/test time log loss AUC #param train/test time

FM 0.1702 0.9771 1.38M 1 0.2538 0.9503 46.40M 1 0.0023 0.0008 0.0009 0.0006

Di Facto 0.1711 0.9771 0.61M 0.63 / 0.2529 0.9450 16.97M 0.42 / 0.0023 0.0004 0.85 0.0007 0.004 0.83

Ra FM 0.1447 0.9811 0.71M 0.73 / 0.2387 0.9526 9.23M 0.24 / 0.0015 0.0002 0.85 0.0005 0.0006 0.71 Avazu Criteo log loss AUC #param train/test time log loss AUC #param train/test time

FM 0.3817 0.7761 18.81M 1 0.4471 0.8030 35.87M 1 0.0001 0.0003 0.0002 0.0002

Di Facto 0.3823 0.7778 10.83M 0.82 / 0.4470 0.8030 19.70M 0.63 / 0.0003 0.0003 1.79 0.0002 0.0004 0.80

Ra FM 0.3801 0.7826 10.17M 0.85 / 0.4451 0.8060 20.88M 0.67 / 0.0002 0.0003 1.20 0.0001 0.0002 0.84

are m = 2, D1 = 32 and D2 = 512. The Ra FM-low in Fig. 3 represents the B1,1, i.e. the FM model with rank 32, but trained simultaneously with B1,2 by Eq. (12b). It is shown that Ra FM signiﬁcantly outperforms all FMs. It is because that Ra FM maintains 32 factors for most of the features with limited occurrences to avoid overﬁtting problems, and allocate 512 factors for features with enough occurrences to guarantee expressiveness. Moreover, Ra FM-low achieves similar performance to FM with 32 factors, which means embeddings in Ra FM has similar expressiveness of embeddings in FM with the same rank.

6.1.3. PERFORMANCE AND COMPLEXITY

Table 2 shows the performance and complexity of Ra FMs under different rank settings. We compare 4 sets of ranks, namely, S1 = {32, 512}, S2 = {32, 128, 512}, S3 = {32, 64, 128, 256, 512}, and S4 = {32, 64, 96, 128, 160, , 480, 512}. Therefore, ranks in S2, S3 varies exponentially, while ranks in S4 varies arithmetically. Ra FMs with these 4 sets achieves similar perfor-

mances, all signiﬁcantly better than FM with 512 factors, which is the best out of all FMs as shown in Fig. 3. Ra FMs with S2, S3 have fewer parameters due to the expenentially increasing rank settings. Ra FM with S4 has a comparable number of parameters to S1, but the train time is much longer. This comparison shows the impacts of the speed of rank increasing on the computational burden, which is consistent with what is discussed in Section 3.3.

6.1.4. RESULTS AND DISCUSSION

As evidenced by Tables 3 and 4, our algorithm signiﬁcantly outperforms the baseline algorithms. For example, the relative improvements on square loss (log loss) criteria are 1% 2% in ML 10M, ML 20M and AMovie, 15% on Frappe, 6% on ML Tag, and 0.5% on Avazu and Criteo. Empirically, all these improvements are regarded as signiﬁcant in the researches on corresponding datasets. Taking Criteo as an example, Ra FM reduces the logloss of FM by 0.002, while an improvement of 0.001 in logloss is considered as practically signiﬁcant (Wang et al., 2017).

Ra FM: Rank-Aware Factorization Machines

Moreover, the complexity of Ra FM is reduced compared to FMs. In fact, Ra FM signiﬁcantly reduces both the model size and the computational time. For example, the model size of Ra FM is only 20% 66% of FM, and the training time is 24% 95% of FM.

Although Difacto can also reduce the computational burden, it cannot guarantee the performance. The major reason is that Difacto assumes the low-dimensional counterpart of a high-dimensional feature can be obtained by parameter sharing. However, this assumption is not usually reasonable. In contrast, MRMA performs better than FM and Di Facto due to it combines models with different ranks. However, its model size and training time are signiﬁcantly larger than Ra FM, while its performance is a bit worse than that of Ra FM due the overﬁtting problem caused by a large number of parameters.

In summary, Ra FM not only achieves a better performance, but also reduces the model size and computational time. Therefore, the proposed Ra FM model also has an attractive potential in industrial applications.

6.2. Experiment B: Industrial Level Click-Through-Rate Dataset

6.2.1. EXPERIMENT SETUP

Here we perform an experiment on the news CTR data provided by Tencent in order to show the potential of Ra FM in industrial applications. We use the records of 8 consecutive days, of which the ﬁrst 7 days are used as the training dataset, the last day is used on the test dataset. The dataset contains 1.7 billion records and 120 millions features.

We compare the proposed Ra FM approach with LR, FM and Di Facto, and the performance standard is the AUC, which is consistent with the requirement of online predictions. For FM/Di Facto/Ra FM, the dimensionality of factors are chosen from {1, 2, 4, 8}. We use the FTRL algorithm to guarantee the sparsity of learned weights, and use distributed learning to accelerate the learning process.

6.2.2. RESULTS

We show the comparisons of AUC and the model size in Fig. 4. We use the number of parameters of LR, which is 13.19M in our experiments, as the unit of the model size. The comparison of training time of these approaches is not provided here, because they are very similar (about 6 hours for training, and 30 minutes for testing in our experiments) due to the fact that the training timeis dominated by the communication time of distributed learning rather than the computational time.

Since the number of records in the training dataset is sufﬁciently large, it is natural to hope that the AUC will con-

1 2 3 4 5 6 7 #params (times of LR)

FM LR Di Facto Ra FM

Figure 4. Comparisons of AUC and the model size on industrial level CTR dataset.

tinuously increase when the model size increases, and the AUCs in Fig. 4 can be further improved if we allow a larger rank of factors. However, the model size to achieve such an improvement is the main issue, since a larger model size will undoubtedly bring a heavier burden to the parameter server. According to Fig. 4, Ra FM increases the AUC by about 1% compared to LR, while its model size is only 1.55 times that of LR. In comparison, FM needs 7 times the model size of LR to achieve a similar performance. In other words, the model size of Ra FM is only 22% that of FM to achieve a similar AUC, which means the proposed Ra FM approach achieves a good trade-off between model size and performance, and has an attractive potential in industrial applications.

7. Conclusion and Future Work

This paper proposes an Ra FM model which adopts pairwise interactions from embeddings with different ranks. Ra FM can be stored and evaluated as efﬁciently as, or even more efﬁciently than FMs with ﬁxed ranks. Moreover, we provide a learning algorithm for efﬁciently training all embeddings in one concise model, and prove that the training error of each FM is bounded from above. Experiments demonstrate that Ra FM not only has better performance in regression and classiﬁcation datasets whose different features have signiﬁcantly varying frequencies of occurrence, but also reduces the computational burden of FMs.

Ra FM is a ﬂexible framework, therefore an interesting direction in future study is to combine it with deep models in order to achieve better performances. Moreover, a more effective hyperparameter tuning approach is also an attractive research direction.

Ra FM: Rank-Aware Factorization Machines

Blondel, M., Fujino, A., and Ueda, N. Convex factorization machines. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 19 35. Springer, 2015.

Blondel, M., Fujino, A., Ueda, N., and Ishihata, M. Higherorder factorization machines. In Advances in Neural Information Processing Systems, pp. 3351 3359, 2016.

Du, C., Li, C., Zheng, Y., Zhu, J., and Zhang, B. Collaborative ﬁltering with user-item co-autoregressive models. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: A factorization-machine based neural network for ctr prediction. ar Xiv preprint ar Xiv:1703.04247, 2017.

He, X. and Chua, T.-S. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355 364. ACM, 2017.

Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, IJCAI 17, pp. 1965 1972. AAAI Press, 2017. ISBN 978-0-9992411-0-3. URL http://dl.acm.org/ citation.cfm?id=3172077.3172161.

Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. Field-aware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 43 50. ACM, 2016.

Lauly, S., Zheng, Y., Allauzen, A., and Larochelle, H. Document neural autoregressive distribution estimation. The Journal of Machine Learning Research, 18(1):4046 4069, 2017.

Li, D., Chen, C., Liu, W., Lu, T., Gu, N., and Chu, S. Mixture-rank matrix approximation for collaborative ﬁltering. In Advances in Neural Information Processing Systems, pp. 477 485, 2017.

Li, L., Zhao, P., Zhou, J., and Li, X. A boosting framework of factorization machine. ar Xiv preprint ar Xiv:1804.06027, 2018.

Li, M., Liu, Z., Smola, A. J., and Wang, Y.-X. Difacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 377 386. ACM, 2016.

Liu, C., Zhang, T., Zhao, P., Zhou, J., and Sun, J. Locally linear factorization machines. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, pp. 2294 2300. AAAI Press, 2017.

Rendle, S. Factorization machines. In IEEE 10th International Conference on Data Mining (ICDM), pp. 995 1000. IEEE, 2010.

Rendle, S. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST), 3 (3):57, 2012.

Wang, R., Fu, B., Fu, G., and Wang, M. Deep & cross network for ad click predictions. Co RR, abs/1708.05123, 2017. URL http://arxiv.org/ abs/1708.05123.

Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., and Chua, T.-S. Attentional factorization machines: Learning the weight of feature interactions via attention networks. ar Xiv preprint ar Xiv:1708.04617, 2017.

Zheng, Y., Liu, C., Tang, B., and Zhou, H. Neural autoregressive collaborative ﬁltering for implicit feedback. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 2 6. ACM, 2016a.

Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural autoregressive approach to collaborative ﬁltering. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pp. 764 773. JMLR.org, 2016b. URL http://dl.acm.org/citation. cfm?id=3045390.3045472.