# rafm_rankaware_factorization_machines__0aafed15.pdf Ra FM: Rank-Aware Factorization Machines Xiaoshuang Chen 1 Yin Zheng 2 Jiaxing Wang 3 Wenye Ma 2 Junzhou Huang 2 Fatorization machines (FM) are a popular model class to learn pairwise interactions by a low-rank approximation. Different from existing FM-based approaches which use a fixed rank for all features, this paper proposes a Rank-Aware FM (Ra FM) model which adopts pairwise interactions from embeddings with different ranks. The proposed model achieves a better performance on real-world datasets where different features have significantly varying frequencies of occurrences. Moreover, we prove that the Ra FM model can be stored, evaluated, and trained as efficiently as one single FM, and under some reasonable conditions it can be even significantly more efficient than FM. Ra FM improves the performance of FMs in both regression tasks and classification tasks while incurring less computational burden, therefore also has attractive potential in industrial applications. 1. Introduction Factorization machines (FM) (Rendle, 2010; 2012) are one of the most popular models to leverage the interactions between features. It models nested feature interactions via a factorized parametrization, and achieves success in many sparse predictive areas, such as recommender systems and click-through rate predictions (Juan et al., 2016). Recently there are many follow-up researches, such as high-order FMs (Blondel et al., 2016), convex FMs (Blondel et al., 2015), neural-network-based FMs (He & Chua, 2017; Guo et al., 2017), locally linear FMs (Liu et al., 2017), etc. Most of the existing approaches allocate a fixed rank of embedding vectors to each feature (Liu et al., 2017; Guo et al., 2017; Zheng et al., 2016b;a; Du et al., 2018; Lauly et al., 2017; Jiang et al., 2017). However, the frequencies 1Department of Electrical Engineering, Tsinghua University, Beijing, China 2Tencent AI Lab, Shenzhen, China 3Institute of Automation, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, Beijing, China. Correspondence to: Yin Zheng . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1-2 3-4 5-8 9-16 17-32 33-64 65-128 >128 #Occurrences (a) ML Tag 1-2 3-4 5-8 9-16 17-32 33-64 65-128 >128 #Occurrences (b) Avazu Figure 1. Occurrences of different features in (a) ML Tag; (b) Avazu. Example: in ML Tag, more than 40,000 features only occur once or twice, while less than 5,000 features occur more than 128 times. of different features occurring in real-world datasets vary a lot. Fig. 1 shows the numbers of occurrences of different features in two public datasets, i.e. Movie Lens1 Tag and Avazu2 respectively. It is shown that a large number of features scarsely occur in the dataset, while only a few features frequently occur. Fixing the rank of FMs may lead to overfitting problems for features with few occurrences, and underfitting problems for features with many occurrences. There are also researches allocating with different ranks to each feature(Li et al., 2016; 2017; Juan et al., 2016). However, training and storing multiple embeddings lead to unaffordable computational burden, and the large number of parameters may even aggravate the overfitting problem (Li et al., 2017). To this end, this paper proposes a Rank-Aware FM (Ra FM) model, i.e., to maintain embedding vectors with different ranks for each feature, which makes it possible to compute each pairwise interaction via proper ranks of embedding vectors. A key contribution of this paper is to prove that although Ra FM maintains multiple embeddings with different ranks for each feature, it can be stored, evaluated, and trained as efficiently as, or even more efficiently than a single FM with a fixed rank. 1https://grouplens.org/datasets/movielens 2https://www.kaggle.com/c/ avazu-ctr-prediction/data Ra FM: Rank-Aware Factorization Machines Specifically, we analyze the time and space complexity of Ra FM, and show that there are many inactive factors which need not be stored and evaluated. Therefore, both the computational time and the storage can be significantly reduced, and the entire computational burden can be even smaller than that of a single FM under some reasonable conditions. Furthermore, we provide an algorithm to train all embedding vectors of Ra FM in one concise model where the inactive factors will not be stored and trained, and then prove the convergence rate and the performance bound of the training algorithm. Thus, the computational burden of the training process can also be effectively reduced. Experiments show the effectiveness of Ra FM in both public datasets and datasets of industrial applications. The proposed Ra FM model improves the performance of FMs while incurring a comparable or even less computational burden, therefore also has attractive potential in industrial applications. 2. Problem Formulation 2.1. General Interaction Form of FMs Here we provide a general interaction form of the FM model: i,j F,i ki, the rank of v(k) i will be too large compared to the occurrences of the i-th feature, which will lead to the overfitting problem. ki can be chosen according to the number of occurrences of the i-th feature, and in this paper, we regard ki as hyperparameters. Although some boosting methods (Li et al., 2018) might be used to select appropriate ki, we leave it for future work. Here we emphasize that the multiple embeddings in Vi are not independent in the Ra FM model. Intuitively speaking, v(1) i can be regarded as a projection of v(2) i from the D2dimensional space to the D1-dimensional space. We will return to this point in Section 4, which provides an efficient training problem maintaining this property. The pairwise interaction can be computed as Vi, Vj Ra F M = v(kij) i v(kij) j (4) where kij = min(ki, kj) (5) which means the dimensionality to compute the pairwise interaction between the i-th and j-th features depends on the maximum common dimensionality of the embedding vectors of the i-th and the j-th feature. Therefore, the computation of Vi, Vj achieves a good trade-off between overfitting problems and underfitting problems, which is the intuition why Ra FM outperforms FM. Fig. 2 also illustrates the idea of Ra FM. A key challenge of Ra FM is the computational burden when storing, evaluating, and training Ra FM: 1. It is well-known that FM can be evaluated efficiently in O(D |F|). However, the straight computation of Ra FM, i.e. (1)(4), is O(D |F|2), which is not satifying. Ra FM: Rank-Aware Factorization Machines 2. To train the Ra FM, we need to train multiple embedding vectors for a feature, which incurs a large computational burden and storage consumption. We separate the discussions into two sections: Section 3 shows that Ra FM can be stored and evaluated as efficient, or even more efficient than a single FM, whereas Section 4 provides an efficient training algorithm. 3. Model Complexity of Ra FM This section analyzes the space and time complexity of Ra FM. Trivial bounds of the space and time complexity are O(Pm k=1 Dk |F|) and O(D |F|2) respectively. In this section, they will be significantly improved to be comparable or even less than a single FM under reasonable conditions. 3.1. Space Complexity Before discussing the complexity of Ra FM, we introduce a class of feature sets, i.e. Fk = {i F : ki k}, k = 1, 2, , m (6) then we have F1 F2 Fm. We have F = F1 due to ki 1 by definition. Although the Ra FM in Eq. (3) contains m group of embeddings with different ranks, not all the parameters each group of embeddings will be used. Specifically, if we use F Fk to denote the set difference of F and Fk, then for i F Fk and l > k, the factor v(l) i will never be used according to Eq. (4). These factors are called inactive factors. In other words, only v(k) i , i Fk need to be maintained, which are called active factors. Therefore we have Proposition 1. The space complexity of parameters in Ra FM (3) is O (Pm k=1 Dk |Fk|). 3.2. Time Complexity To derive the time complexity of Ra FM, we introduce two notations, i.e. Al,k and Bl,k. Specifically, i,j Fk,ik v(k) i v(k) j xixj ik v(k+1) i v(k+1) j xixj (9) Moreover, according to (4), the feature set {i < j : kij > k} can be rewritten as {i < j : kij > k} = {i < j : i, j Fk+1} (10) Then we have the 4th property according to the definition of Ak,k+1 and Ak+1,k+1. Corollary 4. Regarding the computational complexity of Ra FM, we have 1. The time complexity of Bl,k is O Dl |F| + Pk p=l+1 Dp |Fp| ; 2. The time complexity of the Ra FM model (3), i.e. B1,m, is O (Pm k=1 Dk |Fk|). Proof. We only prove the 1st statement, and the 2nd statement is the direct corollary of the 1st when l = 1, k = m. We prove by induction. When k = l, the time complexity of Bl,l is O (Dl |F|). If the time complexity of Bl,k is O Dl |F| + Pk p=l+1 Dp |Fp| , then the time complexity of Bl,k+1 should be p=l+1 Dp |Fp| + O (Dk |Fk+1|) + O (Dk+1 |Fk+1|) = O p=l+1 Dp |Fp| Ra FM: Rank-Aware Factorization Machines Table 1. Complexity of Ra FM under different conditions m D) Dk = Θ(2k m D) |Fk| = Θ((1 k 1 m ) |F|) O(m D |F|) O(D |F|) |Fk| = Θ(21 k |F|) O(D |F|) O( m 2m 1 D |F|) where the fact Dk Dk+1 is used. Remark 5. Similar to FM, when the data is sparse, |Fk| should be replaced by n(Fk), which is the expected number of occurrences of Fk in a data sample. In such cases, the time complexity is in the sense of expectation. 3.3. Comparison with FM This section compares the complexity of Ra FM with that of FM. When using a single FM as the predictor, we usually use a large rank to ensure the performance, and use regularization to avoid overfitting. Here we use the FM with rank Dm for comparison, of which the space and time complexities are O(D |F|) and O(Dn(F)) respectively. The space and time complexity of Ra FM are similar to each other except that |Fk| should be replaced by n(Fk) in the time complexity, so we discuss them together. Generally, when k increases, Dk will increase and |Fk| will decrease. Table 1 provides the complexity under different speeds of Dk increasing and |Fk| decreasing. It is shown that if Dk increases and |Fk| decreases moderately (e.g. linearly), the complexity of Ra FM will be large. However, if one or two of them vary rapidly (e.g. exponentially), the complexity of Ra FM will be comparable or smaller than the FM model. In practice, it is widely-accepted that Dk varies rapidly when k increases (He & Chua, 2017; Li et al., 2017). As indicated by Fig. 1, |Fk| decreases rapidly when k increases. In contrast, the speed of n(Fk) decreasing may not be as rapidly as that of |Fk|, but is still likely to be superlinear. Therefore, in such conditions, Ra FM will significantly reduce the space complexity of FM while incur a comparable or smaller time complexity. This statement will also be validated by experiments in Section 6.1.3. 4. Efficient Learning of Ra FM This section provides a computationally efficient learning algorithm of Ra FM, i.e. Algorithm 1, where the inactive factors need not be stored and trained. We first provide the objective function and the learning algorithm, then prove that the proposed algorithm is to train the upper bounds of all the m FMs simultaneously. We emphasize that although the analysis in this section is somehow technical, the final algorithm, i.e., Algorithm 1, can be regarded as an extension to SGD methods, hence is easy to implement. Algorithm 1 Training the Ra FM 1: Initialize all the parameters 2: while not convergent do 3: Sample a data point (x, y) randomly 4: for 1 p < m do 5: v(p) Fp+1 v(p) Fp+1 ρd L(B1,p,B1,p+1) v(p)|Fp+1 6: v(p) Fp Fp+1 v(p) Fp Fp+1 ρf L(B1,m,y) v(p)|Fp Fp+1 7: end for 8: v(m) Fm v(m) Fm ρf L(B1,m,y) v(m)|Fm 9: end while 4.1. Constrained Optimization of Ra FM The goal of the training algorithm is to obtain the parameters in B1,m. According to Eq. (7) and Theorem 3, the parameters are v(p) Fp , 1 p m. In other words, v(p) F Fp , 1 p m are the inactive factors that need not be trained. In order to avoid the training of inactive factors, we provide the following bi-level optimization model: x L(B1,m, y) (12a) s.t. v(p) Fp+1 = arg min 1 x L(B1,p, B1,p+1), 1 p < m where L is the loss function, and v(p) Fp+1 denotes the v(p) i where i Fp+1. The variables in Eq. (12) can be classified into two groups, i.e. free variables and dependent variables. v(p) Fp Fp+1 , 1 p < m and v(m) Fm are free vari- ables, while v(p) Fp+1 , 1 p < m are dependent vari- ables since they are determined by v(p+1) Fp+1 via the constraint (12b). The basic idea of Eq. (12) is to regard vi in its highest dimensionality as free variables to be optimized, and to regard other lower-dimensionality counterparts as its projections in lower dimensions, which can be approximated by the constraint (12b). Note that inactive factors v(p) F Fp , 1 p m does not exist in Eq. (12). Therefore, Eq. (12) makes it possible to maintain the model size given by Proposition 1 in the training process. 4.2. Learning This subsection will show that Eq. (12) can be efficiently trained. Due to (12b), the dependent variables v(p) Fp+1 can be obtained by taking the stochastic gradient descent (SGD) algorithm on the first term in L(B1,p, B1,p+1). The major challenge is the estimation of the gradient direction Ra FM: Rank-Aware Factorization Machines of the free variables due to that the optimization problem (12) is a multi-stage optimization problem. Taking v(p) Fp Fp+1 for a certain p as an example, the gradient with respect to v(p) Fp Fp+1 should be L(B1,m, y) v(p) Fp Fp+1 + L(B1,m, y) v(p 1) Fp v(p) Fp Fp+1 where v(p 1) Fp / v(p) Fp Fp+1 is a |Fp| Dp 1 (|Fp| |Fp+1|)Dp matrix representing the relationship between the dependent variable v(p 1) Fp and the free vari- able v(p) Fp Fp+1 given by the constraint (12b). According to the theorem of implicit functions, v(p 1) Fp v(p) Fp Fp+1 = 2L(B1,p 1, B1,p) 2L(B1,p 1, B1,p) v(p 1) Fp v(p) Fp Fp+1 Apply Eq. (14) to Eq. (13), L(B1,m, y) v(p) Fp Fp+1 v(p 1) Fp G 1 2L(B 1,p 1, B 1,p) v(p 1) Fp v(p) Fp Fp+1 (15) where G = 1 2L/ v(p 1) Fp 2 is the first term in the right of (14), which is a |Fp| Dp 1 |Fp| Dp 1 matrix. B l,k denotes the Bl,k of input vector x . By exchanging x and x in the second term of the right of (15), and using one sample x to estimate the gradient, we have [ grad = L(B1,m, y) v(p) Fp Fp+1 L(B 1,m, y) v(p 1) Fp G 1 2L(B1,p 1, B1,p) v(p 1) Fp v(p) Fp Fp+1 (16) In the right of Eq. (16), the first term can be obtain by the chain rule, while the second term contains second order derivatives which are challenging to compute. However, we have the following theorem: Theorem 6. The direction of [ grad is parallel to its first term L(B1,m, y)/ v(p) Fp Fp+1. Proof. See Section 1 in the supplementary material. In SGD, the direction of the gradient vector is more important than the length. Theorem 6 shows that we can use the first term in Eq. (16) to estimate the descent direction over free variables. The same discussion can be applied to v(m) Fm, therefore we can use Algorithm 1 to learn Eq. (12), where ρd and ρf are the learning rates of dependent variables and free variables, respectively. Now we discuss the complexity of Algorithm 1. Note that in each step, only active factors are used and updated, which means the space complexity follows Proposition 1. Moreover, we do not need to compute B1,p seperately since the it is a part of B1,m. Therefore, the time complexity also follows Corollary 4. Therefore, as discussed in Section 3.3, training Ra FM via Algorithm 1 is as efficient as, or more efficient than training a single FM. Moreover, it can be proven that the proposed learning algorithm minimizes an upper bound of each FM with a single rank Dk, 1 k m. We put the proof in Section 2 of the supplementary material due to space constraints. 5. Related Work The most related researches on the combination of FMs with different ranks are Di Facto(Li et al., 2016), MRMA(Li et al., 2017), and FFM(Juan et al., 2016). We compare them with Ra FM thoroughly in this section. There are many other researches to improve the performance of FMs by deep models, such as NFM(He & Chua, 2017), AFM(Xiao et al., 2017), Deep FM(Guo et al., 2017), etc. The idea of Ra FM is orthogonal to these researches. It is also an attractive direction to combine Ra FM with these models, and we leave it for future work. 5.1. Di Facto Di Facto also uses multiple ranks in one FM. However, for each feature, Di Facto only allocates an embedding vector with a single rank, while Ra FM allocates multiple embeddings with different ranks. In Di Facto, the pairwise interaction between features with different ranks is obtained by simply truncating the embedding with a higher rank to a lower rank. In cases such as SVD of a complete matrix, such truncations are reasonable, since the best k-rank approximation is equivalent to the k-prefix of any k + n-rank approximation. However, in recommender systems where the training samples are highly sparse, such truncations usually lead to worse performances, as will be shown in Section 6. Therefore, Di Facto reduces the computational burden of FM by sacrificing the performance, while Ra FM improves the performance of FM with a lower computational burden. Ra FM: Rank-Aware Factorization Machines 32 64 128 256 512 Ra FM-low Ra FM Model FM Ra FM-Low Ra FM Figure 3. FM vs. Ra FM in ML Tag Table 2. Performance and Complexity under Different Settings logloss #param train time FM(D=512) 0.2538 46.40M 1 Ra FM(S1) 0.2405 17.89M 0.43 Ra FM(S2) 0.2416 9.63M 0.17 Ra FM(S3) 0.2387 9.23M 0.24 Ra FM(S4) 0.2391 17.75M 1.35 MRMA is a matrix approximation model, of which the key idea is also to combine models with different ranks. The major difference is that it stores the entire models with different ranks, thus leading to a large computational burden and storage burden. Moreover, the large number of parameters may also cause severe overfitting problems, and this is why (Li et al., 2017) provides the Iterated Conditional Modes (ICM) to train MRMA. In contrast, Ra FM will significantly reduce the number of parameters by eliminating inactive factors, and will not be likely to suffer from overfitting problems. Ra FM is totally different from FFM, although they both use multiple embeddings for each feature. On the one hand, FFM uses different embeddings for interactions between different field-pairs, while the concept field never exists in Ra FM. For problems without the concept of fields or problems with only 2 fields, FFM fails or degenerates to the original FM, while Ra FM still works. On the other hand, different embeddings in FFM are independent, while embeddings in a lower rank can be regarded as the projection of embeddings in higher rank, which is guaranteed by the learning algorithm. This property largely reduces the computational burden and avoids the overfitting problem, while FFM suffers from the large computational burden. 6. Experiments This section provides the experiments of the proposed approach and other benchmarks in several datasets. In Experiment A, we test our approach on 7 public datasets while in Experiment B, we perform the Ra FM approach on the news CTR data provided by Tencent to show the effectiveness of the proposed approach in industrial applications. The code of Ra FM is available at https: //github.com/cxsmarkchan/Ra FM. 6.1. Experiment A: Public Datasets 6.1.1. EXPERIMENT SETUP We consider 3 datasets for regression tasks and 4 for classification tasks. All these datasets are randomly split into train (80%), validation (10%), and test (10%) sets. Datasets for regression tasks are the Movie Lens 10M (ML 10M), 20M (ML 20M), and the Amazon movie review dataset3 (AMovie), respectively, of which the square loss is used as the performance criterion. Datasets for classification tasks are Frappe4, Movielens Tag (ML Tag), Avazu, and Criteo5, respectively, of which the log loss and the area under curve (AUC) are used as the performance criteria. We use the standard FM (Rendle, 2010) and Difacto (Li et al., 2016) as baselines for all datasets, and MRMA as a baseline for datasets containing exactly two fields, i.e. ML 10M, ML 20M, and AMovie, respectively. We do not compare with deep models since the motivation of this work is to show that FMs with different ranks can be efficiently combined and trained while incurring less computational burden. However, we argue that Ra FM is a flexible framework and deep FMs can be embedded. We adopt L2 regularizations for each model, and search the L2 coefficient from {1e 6, 5e 6, 1e 5, . . . , 1e 1} on the validation set. We search the ranks from {32, 64, 128, 256, 512} for each model, except for some discussions in Section 6.1.3, where we use more values. We compare the performance, the model size, the training time, and the test time. Since Ra FM has more hyperparameters, to be fair, we tune ki by the following equation rather than grid search: ki = arg min k |log ni log Dk| (17) where ni is the number of occurrences of the i-th feature. This equation means choosing ki so that Dk is the closest to ni in the sense of logarithm. Moreover, we use the same L2 coefficient for all FM models in the Ra FM, and search it from the same candidate set as baselines. 6.1.2. ADVANTAGES OF RAFM OVER FM WITH FIXED RANKS Fig. 3 shows the comparison between FMs with different ranks and Ra FM in the ML Tag dataset. The ranks of FMs range from 32 to 512, and the hyperparameters of Ra FM 3http://jmcauley.ucsd.edu/data/amazon/ 4http://baltrunas.info/research-menu/ frappe 5http://labs.criteo.com/2014/02/ kaggle-display-advertising-challenge-dataset/ Ra FM: Rank-Aware Factorization Machines Table 3. Results on Regression Tasks ML 10M ML 20M AMovie square #param train/test square #param train/test square #param train/test loss time loss time loss time FM 0.8016 2.66M 1 0.8002 5.45M 1 1.0203 3.25M 1 0.0010 0.0008 0.0046 Di Facto 0.7950 1.79M 0.82 / 0.7948 3.22M 0.70 / 1.0268 1.76M 0.75 / 0.0011 0.95 0.0005 0.80 0.0051 0.75 MRMA 0.7952 4.11M 1.27 / 0.7855 8.43M 1.19 / 1.0071 5.02M 1.27 / 0.0006 1.43 0.0011 1.38 0.0039 1.27 Ra FM 0.7870 1.57M 0.95 / 0.7807 3.63M 0.74 / 0.9986 1.76M 0.75 / 0.0008 1.12 0.0009 0.85 0.0035 0.75 Table 4. Results on Classification Tasks Frappe ML Tag log loss AUC #param train/test time log loss AUC #param train/test time FM 0.1702 0.9771 1.38M 1 0.2538 0.9503 46.40M 1 0.0023 0.0008 0.0009 0.0006 Di Facto 0.1711 0.9771 0.61M 0.63 / 0.2529 0.9450 16.97M 0.42 / 0.0023 0.0004 0.85 0.0007 0.004 0.83 Ra FM 0.1447 0.9811 0.71M 0.73 / 0.2387 0.9526 9.23M 0.24 / 0.0015 0.0002 0.85 0.0005 0.0006 0.71 Avazu Criteo log loss AUC #param train/test time log loss AUC #param train/test time FM 0.3817 0.7761 18.81M 1 0.4471 0.8030 35.87M 1 0.0001 0.0003 0.0002 0.0002 Di Facto 0.3823 0.7778 10.83M 0.82 / 0.4470 0.8030 19.70M 0.63 / 0.0003 0.0003 1.79 0.0002 0.0004 0.80 Ra FM 0.3801 0.7826 10.17M 0.85 / 0.4451 0.8060 20.88M 0.67 / 0.0002 0.0003 1.20 0.0001 0.0002 0.84 are m = 2, D1 = 32 and D2 = 512. The Ra FM-low in Fig. 3 represents the B1,1, i.e. the FM model with rank 32, but trained simultaneously with B1,2 by Eq. (12b). It is shown that Ra FM significantly outperforms all FMs. It is because that Ra FM maintains 32 factors for most of the features with limited occurrences to avoid overfitting problems, and allocate 512 factors for features with enough occurrences to guarantee expressiveness. Moreover, Ra FM-low achieves similar performance to FM with 32 factors, which means embeddings in Ra FM has similar expressiveness of embeddings in FM with the same rank. 6.1.3. PERFORMANCE AND COMPLEXITY Table 2 shows the performance and complexity of Ra FMs under different rank settings. We compare 4 sets of ranks, namely, S1 = {32, 512}, S2 = {32, 128, 512}, S3 = {32, 64, 128, 256, 512}, and S4 = {32, 64, 96, 128, 160, , 480, 512}. Therefore, ranks in S2, S3 varies exponentially, while ranks in S4 varies arithmetically. Ra FMs with these 4 sets achieves similar perfor- mances, all significantly better than FM with 512 factors, which is the best out of all FMs as shown in Fig. 3. Ra FMs with S2, S3 have fewer parameters due to the expenentially increasing rank settings. Ra FM with S4 has a comparable number of parameters to S1, but the train time is much longer. This comparison shows the impacts of the speed of rank increasing on the computational burden, which is consistent with what is discussed in Section 3.3. 6.1.4. RESULTS AND DISCUSSION As evidenced by Tables 3 and 4, our algorithm significantly outperforms the baseline algorithms. For example, the relative improvements on square loss (log loss) criteria are 1% 2% in ML 10M, ML 20M and AMovie, 15% on Frappe, 6% on ML Tag, and 0.5% on Avazu and Criteo. Empirically, all these improvements are regarded as significant in the researches on corresponding datasets. Taking Criteo as an example, Ra FM reduces the logloss of FM by 0.002, while an improvement of 0.001 in logloss is considered as practically significant (Wang et al., 2017). Ra FM: Rank-Aware Factorization Machines Moreover, the complexity of Ra FM is reduced compared to FMs. In fact, Ra FM significantly reduces both the model size and the computational time. For example, the model size of Ra FM is only 20% 66% of FM, and the training time is 24% 95% of FM. Although Difacto can also reduce the computational burden, it cannot guarantee the performance. The major reason is that Difacto assumes the low-dimensional counterpart of a high-dimensional feature can be obtained by parameter sharing. However, this assumption is not usually reasonable. In contrast, MRMA performs better than FM and Di Facto due to it combines models with different ranks. However, its model size and training time are significantly larger than Ra FM, while its performance is a bit worse than that of Ra FM due the overfitting problem caused by a large number of parameters. In summary, Ra FM not only achieves a better performance, but also reduces the model size and computational time. Therefore, the proposed Ra FM model also has an attractive potential in industrial applications. 6.2. Experiment B: Industrial Level Click-Through-Rate Dataset 6.2.1. EXPERIMENT SETUP Here we perform an experiment on the news CTR data provided by Tencent in order to show the potential of Ra FM in industrial applications. We use the records of 8 consecutive days, of which the first 7 days are used as the training dataset, the last day is used on the test dataset. The dataset contains 1.7 billion records and 120 millions features. We compare the proposed Ra FM approach with LR, FM and Di Facto, and the performance standard is the AUC, which is consistent with the requirement of online predictions. For FM/Di Facto/Ra FM, the dimensionality of factors are chosen from {1, 2, 4, 8}. We use the FTRL algorithm to guarantee the sparsity of learned weights, and use distributed learning to accelerate the learning process. 6.2.2. RESULTS We show the comparisons of AUC and the model size in Fig. 4. We use the number of parameters of LR, which is 13.19M in our experiments, as the unit of the model size. The comparison of training time of these approaches is not provided here, because they are very similar (about 6 hours for training, and 30 minutes for testing in our experiments) due to the fact that the training timeis dominated by the communication time of distributed learning rather than the computational time. Since the number of records in the training dataset is sufficiently large, it is natural to hope that the AUC will con- 1 2 3 4 5 6 7 #params (times of LR) FM LR Di Facto Ra FM Figure 4. Comparisons of AUC and the model size on industrial level CTR dataset. tinuously increase when the model size increases, and the AUCs in Fig. 4 can be further improved if we allow a larger rank of factors. However, the model size to achieve such an improvement is the main issue, since a larger model size will undoubtedly bring a heavier burden to the parameter server. According to Fig. 4, Ra FM increases the AUC by about 1% compared to LR, while its model size is only 1.55 times that of LR. In comparison, FM needs 7 times the model size of LR to achieve a similar performance. In other words, the model size of Ra FM is only 22% that of FM to achieve a similar AUC, which means the proposed Ra FM approach achieves a good trade-off between model size and performance, and has an attractive potential in industrial applications. 7. Conclusion and Future Work This paper proposes an Ra FM model which adopts pairwise interactions from embeddings with different ranks. Ra FM can be stored and evaluated as efficiently as, or even more efficiently than FMs with fixed ranks. Moreover, we provide a learning algorithm for efficiently training all embeddings in one concise model, and prove that the training error of each FM is bounded from above. Experiments demonstrate that Ra FM not only has better performance in regression and classification datasets whose different features have significantly varying frequencies of occurrence, but also reduces the computational burden of FMs. Ra FM is a flexible framework, therefore an interesting direction in future study is to combine it with deep models in order to achieve better performances. Moreover, a more effective hyperparameter tuning approach is also an attractive research direction. Ra FM: Rank-Aware Factorization Machines Blondel, M., Fujino, A., and Ueda, N. Convex factorization machines. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 19 35. Springer, 2015. Blondel, M., Fujino, A., Ueda, N., and Ishihata, M. Higherorder factorization machines. In Advances in Neural Information Processing Systems, pp. 3351 3359, 2016. Du, C., Li, C., Zheng, Y., Zhu, J., and Zhang, B. Collaborative filtering with user-item co-autoregressive models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. Deepfm: A factorization-machine based neural network for ctr prediction. ar Xiv preprint ar Xiv:1703.04247, 2017. He, X. and Chua, T.-S. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355 364. ACM, 2017. Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 17, pp. 1965 1972. AAAI Press, 2017. ISBN 978-0-9992411-0-3. URL http://dl.acm.org/ citation.cfm?id=3172077.3172161. Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. Field-aware factorization machines for ctr prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 43 50. ACM, 2016. Lauly, S., Zheng, Y., Allauzen, A., and Larochelle, H. Document neural autoregressive distribution estimation. The Journal of Machine Learning Research, 18(1):4046 4069, 2017. Li, D., Chen, C., Liu, W., Lu, T., Gu, N., and Chu, S. Mixture-rank matrix approximation for collaborative filtering. In Advances in Neural Information Processing Systems, pp. 477 485, 2017. Li, L., Zhao, P., Zhou, J., and Li, X. A boosting framework of factorization machine. ar Xiv preprint ar Xiv:1804.06027, 2018. Li, M., Liu, Z., Smola, A. J., and Wang, Y.-X. Difacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 377 386. ACM, 2016. Liu, C., Zhang, T., Zhao, P., Zhou, J., and Sun, J. Locally linear factorization machines. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2294 2300. AAAI Press, 2017. Rendle, S. Factorization machines. In IEEE 10th International Conference on Data Mining (ICDM), pp. 995 1000. IEEE, 2010. Rendle, S. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST), 3 (3):57, 2012. Wang, R., Fu, B., Fu, G., and Wang, M. Deep & cross network for ad click predictions. Co RR, abs/1708.05123, 2017. URL http://arxiv.org/ abs/1708.05123. Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., and Chua, T.-S. Attentional factorization machines: Learning the weight of feature interactions via attention networks. ar Xiv preprint ar Xiv:1708.04617, 2017. Zheng, Y., Liu, C., Tang, B., and Zhou, H. Neural autoregressive collaborative filtering for implicit feedback. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 2 6. ACM, 2016a. Zheng, Y., Tang, B., Ding, W., and Zhou, H. A neural autoregressive approach to collaborative filtering. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pp. 764 773. JMLR.org, 2016b. URL http://dl.acm.org/citation. cfm?id=3045390.3045472.