# projective_quadratic_regression_for_online_learning__22ea3a73.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Projective Quadratic Regression for Online Learning Wenye Ma Tencent wenyema@tencent.com This paper considers online convex optimization (OCO) problems - the paramount framework for online learning algorithm design. The loss function of learning task in OCO setting is based on streaming data so that OCO is a powerful tool to model large scale applications such as online recommender systems. Meanwhile, real-world data are usually of extreme high-dimensional due to modern feature engineering techniques so that the quadratic regression is impractical. Factorization Machine as well as its variants are efficient models for capturing feature interactions with lowrank matrix model but they can t fulfill the OCO setting due to their non-convexity. In this paper, We propose a projective quadratic regression (PQR) model. First, it can capture the import second-order feature information. Second, it is a convex model, so the requirements of OCO are fulfilled and the global optimal solution can be achieved. Moreover, existing modern online optimization methods such as Online Gradient Descent (OGD) or Follow-The-Regularized-Leader (FTRL) can be applied directly. In addition, by choosing a proper hyper-parameter, we show that it has the same order of space and time complexity as the linear model and thus can handle high-dimensional data. Experimental results demonstrate the performance of the proposed PQR model in terms of accuracy and efficiency by comparing with the state-ofthe-art methods. 1 Introduction In the setting of online learning, the training data arrive in a streaming fashion and the system should provide a response immediately. Specifically, online learning is performed in a sequence of consecutive rounds indexed by time t. At round t, the learner updates a learning variable using simple calculations that are usually based on a single sample only. Online learning is a powerful tool for a wide range of applications such as online rating, news feeding, and ad-click prediction. The paramount framework for online learning is online convex optimization (OCO) (Hazan 2016; Shalev-Shwartz 2012) which can be solved by many stateof-the-art online optimization algorithms such as Online Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Gradient Decent (OGD) (Zinkevich 2003) or Follow-The Regularized Leader (FTRL) (Mc Mahan et al. 2013). It is worth to note that the study of OCO and theoretical analysis about above algorithms are based on convex requirement of OCO. The convexity is two folds. First, the loss function is convex. Second, the feasible set of learning variables is also convex. Meanwhile, in real-world applications, the feature dimension could be extremely high. Therefore, traditional linear models are still the most popular model in such scenario. However, linear models fail to utilize the interaction between features so that lots of artificial feature engineering work is needed. Quadratic regression captures first-order information of each input feature as well as second-order pairwise feature interactions. However, it is usually unacceptable in terms of space complexity when dealing with high-dimensional data because the model size is of order O(d2) with d-dim features. Factorization Machine (FM) (Rendle 2010) is an efficient mechanism for capturing up to second-order feature information with a low-rank matrix in factorized form. It only needs O(md) space with rank m. FM achieves state-of-the-art performance in various applications (Juan, Lefortier, and Chapelle 2017; Li et al. 2016; Lu et al. 2017; Nguyen, Karatzoglou, and Baltrunas 2014; Rendle et al. 2011; Zhong et al. 2016; Chen et al. 2019b; 2019a) and has recently regained significant attention from researchers (Blondel et al. 2016a; 2016b; He et al. 2017; Xu et al. 2016). Unfortunately, the vanilla FM (Rendle 2010) and its variants (Blondel et al. 2016a; Cheng et al. 2014; Juan, Lefortier, and Chapelle 2017; Lu et al. 2017) are nonconvex formulations and unable to fulfill the fundamental requirements demanded in OCO. Several convexified FM formulations (Blondel, Fujino, and Ueda 2015; Lin et al. 2018; Yamada et al. 2017) are proposed to address this weakness but the computation cost is high and thus impractical in highdimensional problems. The above formulations are all based on the low-rank assumption of the feature-interaction matrix and thus to reduce the number of learning variables. However, the question is whether we really need the low-rank assumption? Is there any other way to save parameters? Our answer is that we don t need the low-rank assumption and we can save param- Figure 1: Distribution of Feature Occurrence eters by the intuition of sharing parameters. This intuition is based on our observation that the frequencies of different features occurring in real-world datasets vary a lot. For instance, figure 1 shows the numbers of occurrences of different features in two public datasets, i.e. Movie Lens1 and Avazu2 respectively. It is shown that a large number of features scarcely occur in the datasets, while only a few features frequently occur. So we separate the features into two categories: high frequent features and low frequent features. We assume that the high frequent features are more important so that their interaction are more valuable. To this background, we propose a Projective Quadratic Regression (PQR) model. Specifically, for high frequent features, we adopt the full quadratic form. Within low frequent features, we just ignore their interactions. We then use a set of sharing parameters for high and low frequent feature interactions. In fact, we can rewrite the global bias, the feature weight vector and the feature interaction matrix into an augmented symmetric matrix. The intuiton of sharing parameters we mentioned above is equivalent to restricting the feasible set to be a subset of symmetric matrices. We show that the feasible set are actually convex and then satisfy the first requirement of OCO. Then we rewrite the loss function into a convex function with respect to the augmented matrix and thus fulfill the second requirement of OCO. Based on this scheme, the resulting formulation of PQR can seamlessly meet the aforementioned requirements of the OCO framework. Moreover, due to its convexity, most optimization algorithms such as FTRL can be directly applied to PQR model and theoretical analyses such as regret bounds and convergence rates are still valid. In addition, by choosing a proper hyper-parameter, this kind of matrices can be rep- 1https://grouplens.org/datasets/movielens/ 2https://www.kaggle.com/c/avazu-ctr-prediction/data resented by O(d) parameters so that it has the same space complexity with linear models. We conduct extensive experiments on real-world datasets to evaluate the empirical performance of PQR. As shown in the experimental results in both online rating prediction and online binary classification tasks, PQR outperforms state-ofthe-art online learning algorithms in terms of both accuracy and efficiency, especially for high-dimensional problems. Contribution PQR model captures first-order feature information as well as second-order pairwise feature interaction. It is a convex formulation so that it fulfills the requirements of online convex optimization setting and the optimal solution is theoretically guaranteed. Optimization algorithms for convex models can be applied directly to PQR and all the theoretical analyses are still valid. PQR is a general framework and can be applied to many other tasks such as batch and stochastic settings besides online learning. PQR is efficient in terms of space complexity and computation cost. It has the same order of space and time complexity with linear model with a proper hyper-parameter. We evaluate the performance of the proposed PQR model on both online rating prediction and online binary classification tasks. Experimental results show that PQR outperforms state-of-the-art online learning methods. In this section, we first introduce preliminaries in online learning and then turn to the development of the proposed projective quadratic regression (PQR) model for online convex problem. The PQR model is essentially a quadratic regression model with a constraint on the feasible set. The idea behind this model is sharing parameters. We prove the convexity of PQR so that it fits into the online learning setting and state-of-the-art online optimization algorithms can be applied directly. 2.1 Online Convex Optimization The online learning algorithms are built upon the assumption that the training instances arrive sequentially rather than being available prior to the learning task. The paramount framework for online learning is Online Convex Optimization (OCO) (Hazan 2016; Shalev-Shwartz 2012). It can be seen as a structured repeated game between a learner and an adversary. At each round t {1, 2, . . . , T}, the learner is required to generate a decision point θt from a convex set S Rn. Then the adversary replies to the learner s decision with a convex loss function ft : S R and the learner suffers the loss ft(θt). Specifically, the online learning setting can be described as follows. For index t = 1, 2, . . . , T, the learner chooses learning parameters θt S, where S is convex. Then the environment responds with a convex function ft : S R and the output of the round is ft(θt). The regret is defined by t=1 ft(θt) min θ S t=1 ft(θ ). (1) The goal of the learner is to generate a sequence of decisions {θt|t = 1, 2, . . . , T} so that the regret with respect to the best fixed decision in hindsight is sub-linear in T, i.e. lim T regret T /T = 0. The sub-linearity implies that when T is large enough, the learner can perform as well as the best fixed decision in hindsight. Based on the OCO framework, many online learning algorithms have been proposed and successfully applied in various applications (De Marzo, Kremer, and Mansour 2006; Li and Hoi 2014; Zhao and Hoi 2013). Follow-The Regularized-Leader (FTRL) (Mc Mahan et al. 2013) is one of the most popular online learning algorithms, which is summarized in Algorithm 1 and the regret bound of FTRL is shown in Theorem 1 (Hazan 2016). Algorithm 1: Follow-The-Regularized-Leader Input: η > 0, regularization function R, and convex set S Initialize: θ1 = arg min θ S R(θ) for t = 1 to T do Predict θt Observe the loss function ft and let gt = ft(θt) Update θt+1 = arg min θ S{η t s=1 gs θ + R(θ)} Theorem 1. The FTRL Algorithm 1 attains the following bound on regret regret T 2η t=1 ||gt||2 F + R(θ) R(θ1) If an upper bound on the local norms is known, i.e. ||gt||F GR for all time t, then we can further optimize over the choice of η to obtain regret T 2DRGR 2.2 Quadratic Regression The general quadratic regression is a combination of linear model with pairwise feature interaction. Given an input feature vector x Rd, the prediction ˆy can be obtained with the following formula: ˆy = b + w x + 1 i j d Ai,jxixj (2) where b R is the bias term, w is the first-order feature weight vector and A Rd d is the second-order feature interaction matrix. Clearly we only need the upper triangle of the matrix so that we can assume the matrix is symmetric, i.e., A = AT Rd d. Then we rewrite the bias term b, the first-order feature weight vector w and the second-order feature interaction matrix A into a single matrix C: C = A w w T b Therefore, the prediction ˆy can be represented by a quadratic form with extended feature vector ˆx = (x T , 1)T . In fact, we can define the quadratic regression model by 2 ˆx T Cˆx (4) where S is the set of all (d+1) (d+1) symmetric matrices. Clearly, ˆy is linear and thus convex in C. Under this setting, we can show that the composition of any convex function and the quadratic regression form is still convex in C. Actually we have Proposition 2. Let K R(d+1) (d+1) be a convex set, f : R R a convex function, and ˆy a quadratic form defined in (4). Then f ˆy : K R is convex in C K. Proof. Consider λ [0, 1] and C1, C2 K. By the convexity of K, C = λC1 + (1 λ)C2 K. Then by the linearity of ˆy and convexity of f, we have f(ˆy(C)) = f(ˆy(λC1 + (1 λ)C2)) = f(λˆy(C1) + (1 λ)ˆy(C2)) λf(ˆy(C1)) + (1 λ)f(ˆy(C2)), which shows f ˆy is convex. 0 p1,2 p1,3 . . . p1,k 1 p1,k q1 q1 . . . q1 p1,2 0 p2,3 . . . p2,k 1 p2,k q2 q2 . . . q2 p1,3 p2,3 0 . . . p3,k 1 p3,k q3 q3 . . . q3 ... ... ... . . . ... ... ... ... ... p1,k 1 p2,k 1 p3,k 1 . . . 0 pk 1,k qk 1 qk 1 . . . qk 1 p1,k p2,k p3,k . . . pk 1,k 0 qk qk . . . qk q1 q2 q3 . . . qk 1 qk 0 0 . . . 0 q1 q2 q3 . . . qk 1 qk 0 0 . . . 0 ... ... ... ... ... ... ... ... q1 q2 q3 . . . qk 1 qk 0 0 . . . 0 Figure 2: A(H,L) with H = {1, 2, . . . , k} and L = {k + 1, k + 2, . . . , d} 2.3 Projective Quadratic Regression The major problem of quadratic regression is that the feasible set contains all symmetric matrices which is too large and impractical in real-world applications. Existing formulations are usually based on the low-rank assumption. The vanilla FM is just decomposing the matrix into a product of a low-dimensional matrix and its transpose. But this formulation makes it a non-convex model. Several convexified FM formulations (Lin et al. 2018; Yamada et al. 2017) are based on the same assumption and adding low-rank constraint by restricting the nuclear norm of the candidate matrix. However, it introduces operations such as singular value decomposition and the overall computation cost is actually high and thus it is impractical in real applications with large and high-dimensional datasets. The main idea of our proposed projective quadratic regression model is to restrict the feasible set to a subset of symmetric matrices. It is not based on low-rank assumption but the intuition of sharing parameters. This is based on the assumption that the frequencies of different features occurring vary a lot as shown in Figure 1. We then separate the features into two categories: high frequent features and low frequent features. We assume that the high frequent features are more important so that their interaction are more valuable. Therefore, for high frequent feature interactions, we adopt the full quadratic form. And within low frequent features, we just ignore their interactions. Finally, we use a set of sharing parameters for high and low frequent feature interactions. We denote I = {1, 2, . . . , d} as the index set of all features where d is the dimension of the feature space. The separation of high and low frequent features can be defined as follows. Definition 1. A feature separation is a bi-partition of index set I, i.e., I = H L with H L = , (6) where H is the set of indices of all high frequent features while L for low frequent features. In practice, we have various ways to separate the high and low frequent features. For example, we can sample the Table 1: Computation complexity Linear Models FM Models PQR Time O(nd) O(nmd) O(nd + nk2) Space O(d) O(md) O(d + k2) training data and calculates the features counts. Then we can choose the top-k features as the high frequent features and the rest as low frequent features. Another way may be setting a threshold, any feature whose occurrence is greater than the threshold can be considered as a high frequent feature, and otherwise a low frequent feature. In this paper, we use the top-k method where k is a hyper-parameter and can be called the order of the model. After separating the indices, i.e., I = H L, the key idea of PQR is using the following matrix to catch feature interactions: Ai,i = 0 for all i I Ai,j = pi,j if i < j and i, j H Ai,j = qi if i < j, i H and j L Ai,j = 0 if i = j and i, j L (7) where pi,j and qi are learning parameters. Definition 2. Given a separation I = H L, the matrix with the form in (7) can be called a PQR matrix with respect to this separation. The PQR matrix can be denoted as A(H,L). Let us assume that the size of H is k, i.e., |H| = k. So |L| = d k. Without losing generality, if we assume H = {1, 2, . . . , k} and L = {k + 1, k + 2, . . . , d}, then the matrix looks like in Figure 2. We can easily see that there are 1 2k(k+1) learning parameters for the PQR matrix. Combining with the linear part of the model, the space complexity of learning parameters is of the order O(k2 + d). In particular, if we choose k in the order of O( d), the PQR model have the same order of computation cost as linear models. The computation cost is summarized in Table 1. Now we can show that the PQR model fulfills the OCO framework. For a given separation I = H L, Let T (H,L) = {C|C = A(H,L) w w T b where A(H,L) is the PQR matrix. We then have Proposition 3. T (H,L) is a convex set for any separation I = H L. Proof. It is easy to verify that T (H,L) is closed under addition and scalar multiplication. By the definition of convex set, T (H,L) is convex. Now we define our projective quadratic regression model (PQR) as a quadratic regression model with restriction the feature-interaction matrices to be the PRQ matrices. In summary, the proposed PQR model can be represented by 2 ˆx T Cˆx (9) s.t. C T (H,L) By Proposition 2 and 3, we can see that PQR model (9) fulfills the OCO setting. Examples There are two important examples: online rating prediction and online binary classification. In the online rating prediction task, at each round, the model receives a pair of user and item sequentially and then predicts the value of the incoming rating correspondingly. Denote the instance arriving at round t as (ut, it, yt), where ut, it and yt represent the user, item and the rating (a number from 1 to 5) given by user ut to item it. The user feature and item feature (still denoted by ut and it to save notation) can be represented by a concatenation of one-hot vectors. The input feature vector xt is constructed as a concatenation of ut and it. Upon the arrival of each instance, the model predicts the rating with ˆyt = 1 2 ˆxt Cˆxt, where ˆxt = [x T t , 1]T . The convex loss function incurred at each round is the squared loss: ft(ˆy(Ct)) = 1 2||ˆy(Ct) yt||2 2 (10) The online binary classification task is usually applied to the click-through-rate (CTR) prediction problem. In this task, the instances are denoted as (xt, yt) indexed by t [1, 2, . . . , T], where xt is the input feature vector and yt { 1, 1} is the class label. At round t, the model predicts the label with sign(ˆyt) = sign( 1 2 ˆx T t Ctˆxt), where ˆxt = [x T t , 1]T . The loss function is a logistic loss function: ft(ˆy(Ct)) = log(1 + 1 e yt ˆy(Ct) ) (11) Implementation Details We first calculate the separation I = H L. We can achieve this by sampling the data, calculating the feature counts and selecting top-k features as H and the rest as L. Then we find that PQR model is actually a set of rules to cross features. If x is a feature vector, we can Algorithm 2: FTRL-Proximal for the PQR model Pre-calculate separation I = H L. Parameter: α > 0, β > 0, λ1 > 0, λ2 > 0 Initialize: z = n = 0 for t = 1 to T do Receive feature vector xt. Compute the PQR expansion xt by (12) Let It = {i| xt,i = 0} for all i It do 0 if |zi| < λ1 (zi sgn(zi)λ1) α +λ2) if |zi| λ1 end compute pt = w xt for rating prediction; sigmoid(w xt) for binary classification. for all i It do gi = (pt yt)xt,i σi = 1 ni + g2 i ni) zi zi + gi σiwi ni ni + g2 i end end expand it into a new vector, denoted by x. It can be defined by xi = xi if i I xd+ki+j = xixj if i, j H xd+k2+i = xixj if i H, j L (12) Actually we have the following definition. Definition 3. Given a feature vector x and a separation I = H L, the vector x defined in (12) is called the PQR expansion of x. Therefore, the PQR model on a dataset is essentially equivalent to the linear model with respect the associated PQR expansions of this dataset. We list the implementation detail of online rating and online binary classification in Algorithm 2 by following the FTRL-Proximal algorithm in (Mc Mahan et al. 2013). More Explanation on PQR Like vanilla FM, the idea behind the PQR model is matrix factorization. In fact, the PQR matrix with respect to I = H L, can be considered as a decomposition in the following form A(H,L) = P T MP (13) where P is (k + 1) d matrix and M is a (k + 1) (k + 1) matrix. M is symmetric and its diagonals are zeros. Namely, Mi,j = Mj,i and Mi,i = 0. So it has 1 2k(k + 1) parameters which represents the feature-interaction learning parameters. The matrix P is a projection matrix which maps the original feature vector into a lower dimensional space. Specifically, Px is a permutation of [xi1, xi2, . . . , xik, x L] where il H with l = 1, 2, . . . , k and x L = j L xj. The key idea is that the projection matrix P is determined by the separation I = H L, so we don t need to learn it. That is also the reason why we name the proposed model as projective quadratic regression model. 3 Related Work Vanilla FM The vanilla Factorization Machine (Rendle 2010) defines the interaction matrix A = V V T where V is a d m matrix with m d and optimized on thin matrix V . It considers both the first and the second order information of features and learns the feature interactions automatically. It yields state-of-the-art performance in various learning tasks and its space complexity and computation cost are acceptable. However, it is a non-convex formulation and cannot fit into the online convex optimization setting. Moreover, the global optimal solution is not guaranteed by gradient based optimization algorithms. The bad local minima and saddle points may affect the performance and it is difficult to study how to select a good initial point theoretically. CFM The Convex FM (CFM) introduced by Yamada et al. (Yamada et al. 2017) is a convex variant of the widely used Factorization Machines. It employs a linear + quadratic model and regularizes the linear term with the l2-norm and the quadratic term with the nuclear norm. It outperforms the vanilla FM on some applications. However, it is impractical in high-dimensional datasets due to high computation cost. CCFM The Compact Convexified FM (CCFM) model invented by X. Lin et al. (Lin et al. 2018) is another convex model. By rewriting the bias term b, the first-order feature weight vector w and the second-order feature interaction matrix A into a single matrix C and restricting the feasible set by intersecting a nuclear ball, this formulation fits OCO setting. However, the computation cost is very high since at each iteration, a singular value decomposition operation is needed, which is unbearable in a real application. Di Facto The Difacto introduced by M. Li et al. (Li et al. 2016) is a variant of Factorization Machines. It is not a convex model but it also use the information of frequency of feature occurrence. In Di Facto, the features are allocated embedding vectors with different ranks based on the occurrence frequency. Namely, higher frequency features corresponds to embedding vectors with larger ranks. The pairwise interaction between features with different ranks is obtained by simply truncating the embedding with high rank to a lower rank. Such truncations usually lead to worse performance. In fact, Di Facto reduces the computational burden of FM by sacrificing the performance. 4 Experiments In this section, we evaluate the performance of the proposed PQR model on two popular machine learning tasks: online rating prediction and online binary classification. Table 2: Statistics of datasets Datasets #Feature #Instances Label ML100K 100,000 100,000 Numerical ML1M 100,000 1,000,209 Numerical ML10M 200,000 10,000,054 Numerical Avazu 1,000,000 40,528,967 Binary Criteo 1,000,000 51,882,752 Binary DD2012 54,686,452 149,009,105 Binary 4.1 Experimental Setup We compare the empirical performance of PQR with stateof-the-art variants of FM in the online learning setting. We construct the baselines by applying online learning approaches to the existing formulations of FM: vanilla FM (Rendle 2010), CFM (Yamada et al. 2017), and Di Facto (Li et al. 2016). We also compare with LR model (linear regression of online rating prediction and logistic regression for online binary classification) since it is still popular in the high-dimensional sparse datasets. We skip CFM for highdimensional datasets due to its high computation cost. For optimization method, We apply the state-of-the-art FTRLProximal algorithm (Mc Mahan et al. 2013) with ℓ1 +ℓ2 regularization for LR and PQR. Specifically, the implementation of PQR-FTRL can be seen in Algorithm 2. For vanilla FM, we just applied the online gradient descent with ℓ1 regularization. To summarize, the compared models are: The vanilla FM and Di Facto are non-convex models so that we need to randomly initialize the learning parameters to escape local minimum. And results are different for each trial run. Therefore, we repeat the same experiment for 20 times. Finally, the average and standard deviation of 20 scores (RMSE, AUC or Log Loss) are reported. On the other hand, LR, CFM and PQR are convex models. So we can initialize the learning parameters as zeros (like in Algorithm-1 and Algorithm-2). As long as the input data (train/test) are the same, the output scores are the same and thus there is no need for repetition. Moreover, for PQR model, we also test different orders (k values) for each dataset. Datasets For the online rating prediction task, we use the typical Movie Lens datasets, including Movie Lens-100K (ML100K), Movie Lens-1M (ML1M) and Movie Lens-10M (ML10M); For the online binary classification task, we select high-dimensional sparse datasets including Avazu, Criteo3, and KDD20124. These three high-dimensional datasets are preprocessed and can be downloaded from the LIBSVM website5. The statistics of the datasets are summarized in Table 2. All these datasets are randomly separated into train(80%), validation(10%) and test(10%) sets. In our experiments, the training instances are fed one by one to the model sequentially. 3http://labs.criteo.com/2014/02/kaggle-display-advertisingchallenge-dataset/ 4https://www.kaggle.com/c/kddcup2012-track1/data 5https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ Table 3: RMSE on Movie Lens datasets RMSE LR FM CFM Di Facto PQR (k=500) PQR (k=1000) PQR (k=2000) ML100K 1.0429 1.0316 1.2e-3 1.0326 1.0312 9.0e-4 1.0225 1.0215 1.0215 ML1M 1.0487 1.0434 5.0e-4 1.0403 1.0411 9.0e-4 1.0321 1.0250 1.0190 ML10M 0.9651 0.9605 1.7e-3 0.9616 0.9572 1.1e-3 0.9547 0.9541 0.9536 Table 4: AUC, Log Loss, Training Time, and Model Size of high-dimensional datasets Avazu LR FM Di Facto PQR (k=2000) PQR (k=4000) PQR (k=8000) AUC 0.7562 0.7752 1.8e-4 0.7743 1.1e-4 0.7757 0.7777 0.7785 Log Loss 0.3939 0.3840 1.1e-4 0.3840 1.2e-4 0.3830 0.3818 0.3812 Training Time 1 58.5 58.0 3.3 3.6 3.8 Model Size 16.37K 1.17M 0.29M 0.68M 1.32M 1.97M Criteo LR FM Di Facto PQR (k=200) PQR (k=500) PQR (k=1000) AUC 0.7151 0.7218 1.6e-4 0.7216 1.6e-4 0.7207 0.7220 0.7221 Log Loss 0.5116 0.5068 2.0e-4 0.5070 8.0e-5 0.5076 0.5067 0.5066 Training Time 1 20.1 20.6 17.1 18.3 19.2 Model Size 1.09K 17K 14.6K 18.30K 77.68K 98.61K KDD2012 LR FM Di Facto PQR (k=20) PQR (k=50) PQR (k=100) AUC 0.7944 0.7968 1.9e-4 0.7955 3.0e-4 0.8013 0.8015 0.8016 Log Loss 0.1541 0.1535 5.1e-5 0.1534 6.5e-5 0.1524 0.1524 0.1523 Training Time 1 10.4 11.8 3.1 3.1 3.2 Model Size 9.11M 91.6M 72.2M 9.70M 9.71M 9.71M Evaluation Metrics To evaluate the performances on both tasks properly, we select different metrics respectively: the Root Mean Square Error (RMSE) for the rating prediction tasks; and the AUC (Area Under Curve) and Log Loss for the binary classification tasks. For the later case, we also compare the computation time and model size (the number of nonzero parameters) to demonstrate the efficiency of compared algorithms. 4.2 Online Rating Prediction We use Movie Lens datasets for this task. For ML100K and ML1M, we use gender, age, occupation, and zipcode as the user feature and the movie genre as the item feature. We perform the standard discretization, use one-hot representation for each feature, and then concatenate them together. For ML10M, since there is no demographic information of users, we just use the user ids as the user feature instead. The rank of FM is selected from the set of {2, 4, 8, 16, 32, 64, 128, 256, 512}. The coefficient of regularization term and the learning rate are tuned in a range [0.0001, 10]. We list the RMSE of all compared algorithms in Table 3. From our observation, PQR achieves higher prediction accuracy than all the baselines, which illustrates the advantage of PQR. 4.3 Online Binary Classification In many real applications, feature vectors are usually extremely high-dimensional hence sparse representation is used instead, i.e., only nonzero key-value pairs are stored. We demonstrate the performance of PQR as well as other approaches for this case using high-dimensional sparse datasets: Avazu, Criteo, and KDD2012. Due to the highdimensionality, CFM is impractical. So we compare PQR with LR, vanilla FM and Di Facto in two aspects: accuracy (measured by AUC and Log Loss) and efficiency (measured by training time and model size). The rank for vanilla FM and Di Facto is selected from the set of {4, 8, 16, 32, 64}. The coefficient of regularization term and the learning rate are selected in a range of [0.0001, 10]. We list the results in Table 4. In most cases, PQR outperforms the baseline algorithms in both accuracy and efficiency. The accuracy demonstrates that PQR has better expressiveness than linear model and low-rank FM models. The efficiency illustrates the theoretical computation cost of PQR model. Finally, for FM, there is no theory about how to choose a proper rank. However, the order of PQR has a clear meaning which is good for parameter tuning. 5 Conclusion and Future Work In this paper, we propose a projective quadratic regression (PQR) model under the online learning settings. It meets the requirements in online convex optimization framework and the global optimal solution can be achieved due to its convexity. In addition, we show that the computation cost of PQR can be low if we choose a suitable order. Finally, we demonstrate its effectiveness by comparing with stateof-the-art approaches. For future work, a more scientific approach to select the order for PQR is an attractive direction. Blondel, M.; Fujino, A.; Ueda, N.; and Ishihata, M. 2016a. Higher-order factorization machines. In NIPS, 3351 2259. Blondel, M.; Ishihata, M.; Fujino, A.; and Ueda, N. 2016b. Polynomial networks and factorization machines: new insights and efficient training algorithms. In ICML, 850 858. Blondel, M.; Fujino, A.; and Ueda, N. 2015. Convex factorization machines. In ECML-PKDD, 19 35. Chen, X.; Zheng, Y.; Wang, J.; Ma, W.; and Huang, J. 2019a. Rafm: Rank-aware factorization machines. In ICML. Chen, X.; Zheng, Y.; Zhao, P.; Wang, J.; Ma, W.; and Huang, J. 2019b. A generalized locally linear factorization machine with supervised varitional encoding. In IEEE TKDE. Cheng, C.; Xia, F.; Zhang, T.; King, I.; and Lyu, M. R. 2014. Gradient boosting factorization machines. In Proceedings of the 8th ACM Conference on Recommender systems, 265 272. De Marzo, P.; Kremer, I.; and Mansour, Y. 2006. Online trading algorithms and robust option pricing. In STOC, 477 486. Hazan, E. 2016. Introduction to online convex optimization. In Foundations and Trends in Optimization, 157 225. He, X.; Zhang, H.; Wu, F.; Chua, T.-S.; Xiao, J.; and Ye, H. 2017. Attentional factorization machines: Learning the weight of feature interactions via attention networks. In IJCAI, 3119 3125. Juan, Y.; Lefortier, D.; and Chapelle, O. 2017. Field-aware factorization machines in a real-world online advertising system. In Proceedings of the 26th International Conference on World Wide Web Companion, 680 688. Li, B., and Hoi, S. C. 2014. Online portfolio selection: A survey. In Comput. Surveys. Li, M.; Liu, Z.; Smola, A. J.; and Wang, Y.-X. 2016. Difacto: Distributed factorization machines. In WSDM, 377 386. Lin, X.; Zhang, W.; Zhang, M.; Zhu, W.; Pei, J.; Zhao, P.; and Huang, J. 2018. Online compact convexified factorization machine. In Proceedings of the 2018 World Wide Web Conference, 1633 1642. Lu, C.-T.; He, L.; Shao, W.; Cao, B.; and Yu, P. S. 2017. Multilinear factorization machines for multi-task multi-view learning. In WSDM, 701 709. Mc Mahan, H. B.; Holt, G.; Sculley, D.; Young, M.; Ebner, D.; Grady, J.; Nie, L.; Phillips, T.; Davydov, E.; Golovin, D.; Chikkerur, S.; Liu, D.; Wattenberg, M.; Hrafnkelsson, A. M.; Boulos, T.; and Kubica, J. 2013. Ad click prediction: a view from the trenches. In KDD. Nguyen, T. V.; Karatzoglou, A.; and Baltrunas, L. 2014. Gaussian process factorization machines for context-aware recommendations. In SIGIR, 63 72. Rendle, S.; Gantner, Z.; Freudenthaler, C.; and Schmidt Thieme, L. 2011. Fast context-aware recommendations with factorization machines. In SIGIR, 635 644. Rendle, S. 2010. Factorization machines. In ICDM, 995 1000. Shalev-Shwartz, S. 2012. Online learning and online convex optimization. In Trends Machine Learning, 107 194. Xu, J.; Lin, K.; Tan, P. N.; and Zhou, J. 2016. Synergies that matter: Efficient interaction selection via sparse factorization machine. In ICDM, 108 116. Yamada, M.; Lian, W.; Goyal, A.; Chen, J.; Wimalawarne, K.; Khan, S. A.; Kaski, S.; Mamitsuka, H.; and Chang, Y. 2017. Convex factorization machine for toxicogenomics prediction. In KDD, 1215 1224. Zhao, P., and Hoi, S. C. 2013. Cost-sensitive online active learning with application to malicious url detection. In KDD. Zhong, E.; Shi, Y.; Liu, N.; and Rajan, S. 2016. Scaling factorization machines with parameter server. In CIKM, 1583 1592. Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 928 935.