# multiway_interacting_regression_via_factorization_machines__a4c308cd.pdf Multi-way Interacting Regression via Factorization Machines Mikhail Yurochkin Department of Statistics University of Michigan moonfolk@umich.edu Xuan Long Nguyen Department of Statistics University of Michigan xuanlong@umich.edu Nikolaos Vasiloglou Logic Blox nikolaos.vasiloglou@logicblox.com We propose a Bayesian regression method that accounts for multi-way interactions of arbitrary orders among the predictor variables. Our model makes use of a factorization mechanism for representing the regression coefficients of interactions among the predictors, while the interaction selection is guided by a prior distribution on random hypergraphs, a construction which generalizes the Finite Feature Model. We present a posterior inference algorithm based on Gibbs sampling, and establish posterior consistency of our regression model. Our method is evaluated with extensive experiments on simulated data and demonstrated to be able to identify meaningful interactions in applications in genetics and retail demand forecasting.1 1 Introduction A fundamental challenge in supervised learning, particularly in regression, is the need for learning functions which produce accurate prediction of the response, while retaining the explanatory power for the role of the predictor variables in the model. The standard linear regression method is favored for the latter requirement, but it fails the former when there are complex interactions among the predictor variables in determining the response. The challenge becomes even more pronounced in a high-dimensional setting there are exponentially many potential interactions among the predictors, for which it is simply not computationally feasible to resort to standard variable selection techniques (cf. Fan & Lv (2010)). There are numerous examples where accounting for the predictors interactions is of interest, including problems of identifying epistasis (gene-gene) and gene-environment interactions in genetics (Cordell, 2009), modeling problems in political science (Brambor et al., 2006) and economics (Ai & Norton, 2003). In the business analytics of retail demand forecasting, a strong prediction model that also accurately accounts for the interactions of relevant predictors such as seasons, product types, geography, promotions, etc. plays a critical role in the decision making of marketing design. A simple way to address the aforementioned issue in the regression problem is to simply restrict our attention to lower order interactions (i.e. 2or 3-way) among predictor variables. This can be achieved, for instance, via a support vector machine (SVM) using polynomial kernels (Cristianini & Shawe-Taylor, 2000), which pre-determine the maximum order of predictor interactions. In practice, for computational reasons the degree of the polynomial kernel tends to be small. Factorization machines (Rendle, 2010) can be viewed as an extension of SVM to sparse settings where most 1Code is available at https://github.com/moonfolk/Mi FM. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. interactions are observed only infrequently, subject to a constraint that the interaction order (a.k.a. interaction depth) is given. Neither SVM nor FM can perform any selection of predictor interactions, but several authors have extended the SVM by combining it with ℓ1 penalty for the purpose of feature selection (Zhu et al., 2004) and gradient boosting for FM (Cheng et al., 2014) to select interacting features. It is also an option to perform linear regression on as many interactions as we can and combine it with regularization procedures for selection (e.g. LASSO (Tibshirani, 1996) or Elastic net (Zou & Hastie, 2005)). It is noted that such methods are still not computationally feasible for accounting for interactions that involve a large number of predictor variables. In this work we propose a regression method capable of adaptive selection of multi-way interactions of arbitrary order (Mi FM for short), while avoiding the combinatorial complexity growth encountered by the methods described above. Mi FM extends the basic factorization mechanism for representing the regression coefficients of interactions among the predictors, while the interaction selection is guided by a prior distribution on random hypergraphs. The prior, which does not insist on the upper bound on the order of interactions among the predictor variables, is motivated from but also generalizes Finite Feature Model, a parametric form of the well-known Indian Buffet process (IBP) (Ghahramani & Griffiths, 2005). We introduce a notion of the hypergraph of interactions and show how a parametric distribution over binary matrices can be utilized to express interactions of unbounded order. In addition, our generalized construction allows us to exert extra control on the tail behavior of the interaction order. IBP was initially used for infinite latent feature modeling and later utilized in the modeling of a variety of domains (see a review paper by Griffiths & Ghahramani (2011)). In developing Mi FM, our contributions are the following: (i) we introduce a Bayesian multi-linear regression model, which aims to account for the multi-way interactions among predictor variables; part of our model construction includes a prior specification on the hypergraph of interactions in particular we show how our prior can be used to model the incidence matrix of interactions in several ways; (ii) we propose a procedure to estimate coefficients of arbitrary interactions structure; (iii) we establish posterior consistency of the resulting Mi FM model, i.e., the property that the posterior distribution on the true regression function represented by the Mi FM model contracts toward the truth under some conditions, without requiring an upper bound on the order of the predictor interactions; and (iv) we present a comprehensive simulation study of our model and analyze its performance for retail demand forecasting and case-control genetics datasets with epistasis. The unique strength of the Mi FM method is the ability to recover meaningful interactions among the predictors while maintaining a competitive prediction quality compared to existing methods that target prediction only. The paper proceeds as follows. Section 2 introduces the problem of modeling interactions in regression, and gives a brief background on the Factorization Machines. Sections 3 and 4 carry out the contributions outlined above. Section 5 presents results of the experiments. We conclude with a discussion in Section 6. 2 Background and related work Our starting point is a model which regresses a response variable y R to observed covariates (predictor variables) x RD by a non-linear functional relationship. In particular, we consider a multi-linear structure to account for the interactions among the covariates in the model: E(Y |x) = w0 + i Zj xi. (1) Here, wi for i = 0, . . . , D are bias and linear weights as in the standard linear regression model, J is the number of multi-way interactions where Zj, βj for j = 1, . . . , J represent the interactions, i.e., sets of indices of interacting covariates and the corresponding interaction weights, respectively. Fitting such a model is very challenging even if dimension D is of magnitude of a dozen, since there are 2D 1 possible interactions to choose from in addition to other parameters. The goal of our work is to perform interaction selection and estimate corresponding weights. Before doing so, let us first discuss a model that puts a priori assumptions on the number and the structure of interactions. 2.1 Factorization Machines Factorization Machines (FM) (Rendle, 2010) is a special case of the general interactions model defined in Eq. (1). Let J = Pd l=2 D l and Z := SJ j=1 Zj = Sd l=2{(i1, . . . , il)|i1 < . . . < il; i1, . . . , il {1, . . . , D}}. I.e., restricting the set of interactions to 2, . . . , d-way, so (1) becomes: E(Y |x) = w0 + il=il 1+1 βi1,...,il t=1 xit, (2) where coefficients βj := βi1,...,il quantify the interactions. In order to reduce model complexity and handle sparse data more effectively, Rendle (2010) suggested to factorize interaction weights using PARAFAC (Harshman, 1970): βi1,...,il := Pkl f=1 Ql t=1 v(l) it,f, where V (l) RD kl, kl N and kl D for l = 2, . . . , d. Advantages of the FM over SVM are discussed in details by Rendle (2010). FMs turn out to be successful in the recommendation systems setups, since they utilize various context information (Rendle et al., 2011; Nguyen et al., 2014). Parameter estimation is typically achieved via stochastic gradient descent technique, or in the case of Bayesian FM (Freudenthaler et al., 2011) via MCMC. In practice only d = 2 or d = 3 are typically used, since the number of interactions and hence the computational complexity grow exponentially. We are interested in methods that can adapt to fewer interactions but of arbitrarily varying orders. 3 Mi FM: Multi-way Factorization Machine We start by defining a mathematical object that can encode sets of interacting variables Z1, . . . , ZJ of Eq. (1) and selecting an appropriate prior to model it. 3.1 Modeling hypergraph of interactions Multi-way interactions are naturally represented by hypergraphs, which are defined as follows. Definition 1. Given D vertices indexed by S = {1, . . . , D}, let Z = {Z1, . . . , ZJ} be the set of J subsets of S. Then we say that G = (S, Z) is a hypergraph with D vertices and J hyperedges. A hypergraph can be equivalently represented as an incidence binary matrix. Therefore, with a bit abuse of notation, we recast Z as the matrix of interactions, i.e., Z {0, 1}D J, where Zi1j = Zi2j = 1 iff i1 and i2 are part of a hyperedge indexed by column/interaction j. Placing a prior on multi-way interactions is the same as specifying the prior distribution on the space of binary matrices. We will at first adopt the Finite Feature Model (FFM) prior (Ghahramani & Griffiths, 2005), which is based on the Beta-Bernoulli construction: πj|γ1, γ2 iid Beta(γ1, γ2) and Zij|πj iid Bernoulli(πj). This simple prior has the attractive feature of treating the variables involved in each interaction (hyperedge) in an symmetric fashion and admits exchangeabilility among the variables inside interactions. In Section 4 we will present an extension of FFM which allows to incorporate extra information about the distribution of the interaction degrees and explain the choice of the parametric construction. 3.2 Modeling regression with multi-way interactions Now that we know how to model unknown interactions of arbitrary order, we combine it with the Bayesian FM to arrive at a complete specification of Mi FM, the Multi-way interacting Factorization Machine. Starting with the specification for hyperparameters: σ Γ(α1/2, β1/2), λ Γ(α0/2, β0/2), µ N(µ0, 1/γ0), λk Γ(α0/2, β0/2), µk N(µ0, 1/γ0) for k = 1, . . . , K. Interactions and their weights: wi|µ, λ N(µ, 1/λ) for i = 0, . . . , D, Z FFM(γ1, γ2), vik|µk, λk N(µk, 1/λk) for i = 1, . . . , D; k = 1, . . . , K. Likelihood specification given data pairs (yn, xn = (xn1, . . . , xn D))N n=1: yn|Θ N(y(xn, Θ), σ), where y(x, Θ) := w0 + PD i=1 wixi + PJ j=1 PK k=1 Q i Zj xivik, (3) for n = 1, . . . , N, and Θ = {Z, V, σ, w0,...,D}. Note that while the specification above utilizes Gaussian distributions, the main innovation of Mi FM is the idea to utilize incidence matrix of the hypergraph of interactions Z with a low rank matrix V to model the mean response as in Eq. 1. Therefore, within the Mi FM framework, different distributional choices can be made according to the problem at hand e.g. Poisson likelihood and Gamma priors for count data or logistic regression for classification. Additionally, if selection of linear terms is desired, PD i=1 wixi can be removed from the model since FFM can select linear interactions besides higher order ones. 3.3 Mi FM for Categorical Variables In numerous real world scenarios such as retail demand forecasting, recommender systems, genotype structures, most predictor variables may be categorical (e.g. color, season). Categorical variables with multiple attributes are often handled by so-called one-hot encoding , via vectors of binary variables (e.g., IS_blue; IS_red), which must be mutually exclusive. The FFM cannot immediately be applied to such structures since it assigns positive probability to interactions between attributes of the same category. To this end, we model interactions between categories in Z, while with V we model coefficients of interactions between attributes. For example, for an interaction between product type and season in Z, V will have individual coefficients for jacket-summer and jacket-winter leading to a more refined predictive model of jackets sales (see examples in Section 5.2). We proceed to describe Mi FM for the case of categorical variables as follows. Let U be the number of categories and du be the set of attributes for the category u, for u = 1, . . . , U. Then D = PU u=1 card(du) is the number of binary variables in the one-hot encoding and FU u=1 du = {1, . . . , D}. In this representation the input data of predictors is X, a N U matrix, where xnu is an active attribute of category u of observation n. Coefficients matrix V RD K and interactions Z {0, 1}U J. All priors and hyperpriors are as before, while the mean response (3) is replaced by: y(x, Θ) := w0 + u Zj vxuk. (4) Note that this model specification is easy to combine with continuous variables, allowing Mi FM to handle data with different variable types. 3.4 Posterior Consistency of the Mi FM In this section we shall establish posterior consistency of Mi FM model, namely: the posterior distribution Π of the conditional distribution P(Y |X), given the training N-data pairs, contracts in a weak sense toward the truth as sample size N increases. Suppose that the data pairs (xn, yn)N n=1 RD R are i.i.d. samples from the joint distribution P (X, Y ), according to which the marginal distribution for X and the conditional distribution of Y given X admit density functions f (x) and f (y|x), respectively, with respect to Lebesgue measure. In particular, f (y|x) is defined by Y = yn|X = xn, Θ N(y(xn, Θ ), σ), where Θ = {β 1, . . . , β J, Z 1, . . . , Z J}, y(x, Θ ) := i Z j xi, and xn RD, yn R, β j R, Z j {1, . . . , D} (5) for n = 1, . . . , N, j = 1, . . . , J. In the above Θ represents the true parameter for the conditional density f (y|x) that generates data sample yn given xn, for n = 1, . . . , N. A key step in establishing posterior consistency for the Mi FM (here we omit linear terms since, as mentioned earlier, they can be absorbed into the interaction structure) is to show that our PARAFAC type structure can approximate arbitrarily well the true coefficients β 1, . . . , β J for the model given by (1). Lemma 1. Given natural number J 1, βj R \ {0} and Zj {1, . . . , D} for j = 1, . . . J, exists K0 < J such that for all K K0 system of polynomial equations βj = PK k=1 Q i Zj vik, j = 1, . . . , m has at least one solution in terms of v11, . . . , v DK. The upper bound K0 = J 1 is only required when all interactions are of the depth D 1. This is typically not expected to be the case in practice, therefore smaller values of K are often sufficient. By conditioning on the training data pairs (xn, yn) to account for the likelihood induced by the PARAFAC representation, the statistician obtains the posterior distribution on the parameters of interest, namely, Θ := (Z, V ), which in turn induces the posterior distribution on the conditional density, to be denoted by f(y|x), according to the Mi FM model (3) without linear terms. The main result of this section is to show that under some conditions this posterior distribution Π will place most of its mass on the true conditional density f (y|x) as N . To state the theorem precisely, we need to adopt a suitable notion of weak topology on the space of conditional densities, namely the set of f(y|x), which is induced by the weak topology on the space of joint densities on X, Y , that is the set of f(x, y) = f (x)f(y|x), where f (x) is the true (but unknown) marginal density on X (see Ghosal et al. (1999), Sec. 2 for a formal definition). Theorem 1. Given any true conditional density f (y|x) given by (5), and assuming that the support of f (x) is bounded, there is a constant K0 < J such that by setting K K0, the following statement holds: for any weak neighborhood U of f (y|x), under the Mi FM model, the posterior probability Π(U|(Xn, Yn)N n=1) 1 with P -probability one, as N . The proof s sketch for this theorem is given in the Supplement. 4 Prior constructions for interactions: FFM revisited and extended The adoption of the FFM prior on the hypergraph of interactions carries a distinct behavior in contrast to the typical Latent Feature modeling setting. In a standard Latent Feature modeling setting (Griffiths & Ghahramani, 2011), each row of Z describes one of the data points in terms of its feature representation; controlling row sums is desired to induce sparsity of the features. By contrast, for us a column of Z is identified with an interaction; its sum represents the interaction depth, which we want to control a priori. Interaction selection using MCMC sampler One interesting issue of practical consequence arises in the aggregation of the MCMC samples (details of the sampler are in the Supplement). When aggregating MCMC samples in the context of latent feature modeling one would always obtain exactly J latent features. However, in interaction modeling, different samples might have no interactions in common (i.e. no exactly matching columns), meaning that support of the resulting posterior estimate can have up to min{2D 1, IJ} unique interactions, where I is the number of MCMC samples. In practice, we can obtain marginal distributions of all interactions across MCMC samples and use those marginals for selection. One approach is to pick J interactions with highest marginals and another is to consider interactions with marginal above some threshold (e.g. 0.5). We will resort to the second approach in our experiments in Section 5 as it seems to be in more agreement with the concept of "selection". Lastly, we note that while a data instance may a priori possess unbounded number of features, the number of possible interactions in the data is bounded by 2D 1, therefore taking J might not be appropriate. In any case, we do not want to encourage the number of interactions to be too high for regression modeling, which would lead to overfitting. The above considerations led us to opt for a parametric prior such as the FFM for interactions structure Z, as opposed to going fully nonparametric. J can then be chosen using model selection procedures (e.g. cross validation), or simply taken as the model input parameter. Generalized construction and induced distribution of interactions depths We now proceed to introduce a richer family of prior distributions on hypergraphs of which the FFM is one instance. Our construction is motivated by the induced distribution on the column sums and the conditional probability updates that arise in the original FFM. Recall that under the FFM prior, interactions are a priori independent. Fix an interaction j, for the remainder of this section let Zi denote the indicator of whether variable i is present in interaction j or not (subscript j is dropped from Zij to simplify notation). Let Mi = Z1 + . . . + Zi denote the number of variables among the first i present in the corresponding interaction. By the Beta-Bernoulli conjugacy, one obtains P(Zi = 1|Z1, . . . , Zi 1) = Mi 1+γ1 i 1+γ1+γ2 . This highlights the rich-gets-richer effect of the FFM prior, which encourages the existence of very deep interactions while most other interactions have very small depths. In some situations we may prefer a relatively larger number of interactions of depths in the medium range. An intuitive but somewhat naive alternative sampling process is to allow a variable to be included into an interaction according to its present "shallowness" quantified by (i 1 Mi 1) (instead of Mi 1 in the FFM). It can be verified that this construction will lead to a distribution of interactions which concentrates most its mass around D/2; moreover, exchangeability among Zi would be lost. To maintain exchangeability, we define the sampling process for the sequence Z = (Z1, . . . , ZD) {0, 1}D as follows: let σ( ) be a random uniform permutation of {1, . . . , D} and let σ1 = σ 1(1), . . . , σD = σ 1(D). Note that σ1, . . . , σD are discrete random variables and P(σk = i) = 1/D for any i, k = 1, . . . , D. For i = 1, . . . , D, set P(Zσi = 1|Zσ1, . . . , Zσi 1) = αMi 1+(1 α)(i 1 Mi 1)+γ1 i 1+γ1+γ2 , P(Zσi = 0|Zσ1, . . . , Zσi 1) = (1 α)Mi 1+α(i 1 Mi 1)+γ2 i 1+γ1+γ2 , (6) where γ1 > 0, γ2 > 0, α [0, 1] are given parameters and Mi = Zσ1 + . . . + Zσi. The collection of Z generated by this process shall be called to follow FFMα. When α = 1 we recover the original FFM prior. When α = 0, we get the other extremal behavior mentioned at the beginning of the paragraph. Allowing α [0, 1] yields a richer spectrum spanning the two distinct extremal behaviors. Details of the process and some of its properties are given in the Supplement. Here we briefly describe how FFMα a priori ensures "poor gets richer" behavior and offers extra flexibility in modeling interaction depths compared to the original FFM. The depth of an interaction of D variables is described by the distribution of MD. Consider the conditionals obtained for a Gibbs sampler where index of a variable to be updated is random and based on P(σD = i|Z) (it is simply 1/D for FFM1). Suppose we want to assess how likely it is to add a variable into an existing interaction via the expression P i:Z(k) i =0 P(Z(k+1) i = 1, σD = i|Z(k)), where k + 1 is the next iteration of the Gibbs sampler s conditional update. This probability is a function of M (k) D ; for small values of M (k) D it quantifies the tendency for the "poor gets richer" behavior. For the FFM1 it is given by D M (k) D D M (k) D +γ1 D 1+γ1+γ2 . In Fig. 1(a) we show that FFM1 s behavior is opposite of "poor gets richer", while α 0.7 appears to ensure the desired property. Next, in Fig.1 (b-f) we show the distribution of MD for various α, which exhibits a broader spectrum of behavior. G G G G G G G G G G G G G 0.00 0.25 0.50 0.75 1.00 0 10 20 30 Current interaction depth 0.0 0.5 0.7 0.9 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0 10 20 30 mean = 15.0, variance = 2.6 0.00 0.05 0.10 0.15 0.20 0.25 0 10 20 30 mean = 13.5, variance = 7.4 0.00 0.05 0.10 0.15 0.20 0.25 0 10 20 30 mean = 11.9, variance = 15.4 0.00 0.05 0.10 0.15 0.20 0.25 0 10 20 30 mean = 8.3, variance = 38.7 0.00 0.05 0.10 0.15 0.20 0.25 0 10 20 30 mean = 5.0, variance = 60.0 Figure 1: D = 30, γ1 = 0.2, γ2 = 1 (a) Probability of increasing interaction depth; (b-f) FFMα MD distributions with different α. 5 Experimental Results 5.1 Simulation Studies We shall compare Mi FM methods against a variety of other regression techniques in the literature, including Bayesian Factorization Machines (FM), lasso-type regression, Support Vector Regression (SVR), multilayer perceptron neural network (MLP).2 The comparisons are done on the basis of prediction accuracy of responses (Root Mean Squared Error on the held out data), quality of regression coefficient estimates and the interactions recovered. 5.1.1 Predictive Performance In this set of experiments we demonstrate that Mi FMs with either α = 0.7 or α = 1 have dominant predictive performance when high order interactions are in play. In Fig. 2(a) we analyzed 70 random interactions of varying orders. We see that Mi FM can handle arbitrary complexity of the interactions, while other methods are comparative only when interaction structure is simple (i.e. linear or 2-way on the right of the Fig. 2(a)). 2Random Forest Regression and optimization based FM showed worse results than other methods. 1.0 1.5 2.0 2.5 3.0 3.5 0.4 0.6 0.8 1.0 Proportion of 1 and 2 way interactions Mi FM_1 Mi FM_0.7 SVR MLP FM 1.0 1.5 2.0 2.5 3.0 0.00 0.25 0.50 0.75 1.00 Proportion of continues variables Mi FM_1 Mi FM_0.7 SVR MLP FM 1.0 1.1 1.2 1.3 1.4 0.4 0.6 0.8 1.0 Proportion of 1 and 2 way interactions Mi FM_1 OLS_Mi FM_1 Mi FM_0.7 OLS_Mi FM_0.7 Elastic_Net OLS_Elastic 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 alpha Exact recovery proportion Binary Continues Binary _6 Continues_6 Binary _4 Continues_4 Figure 2: RMSE for experiments: (a) interactions depths; (b) data with different ratio of continuous to categorical variables; (c) quality of the Mi FM1 and Mi FM0.7 coefficients; (d) Mi FMα exact recovery of the interactions with different α and data scenarios Next, to assess the effectiveness of Mi FM in handling categorical variables (cf. Section 3.3) we vary the number of continuous variables from 1 (and 29 attributes across categories) to 30 (no categorical variables). Results in Fig. 2(b) demonstrate that our models can handle both variable types in the data (including continuous-categorical interactions), and still exhibit competitive RMSE performance. 5.1.2 Interactions Quality Coefficients of the interactions This experiment verifies the posterior consistency result of Theorem 1 and validates our factorization model for coefficients approximation. In Fig. 2(c) we compare Mi FMs versus OLS fitted with the corresponding sets of chosen interactions. Additionally we benchmark against Elastic net (Zou & Hastie, 2005) based on the expanded data matrix with interactions of all depths included, that is 2D 1 columns, and a corresponding OLS with only selected interactions. Selection of the interactions In this experiments we assess how well Mi FM can recover true interactions. We consider three interaction structures: a realistic one with five linear, five 2-way, three 3-way and one of each 4, . . . , 8-way interactions, and two artificial ones with 15 either only 4or only 6-way interactions to challenge our model. Both binary and continuous variables are explored. Fig. 2(d) shows that Mi FM can exactly recover up to 83% of the interactions and with α = 0.8 it recovers 75% of the interaction in 4 out of 6 scenarios. Situation with 6-way interactions is more challenging, where 36% for binary data is recovered and almost half for continuous. It is interesting to note that lower values of α handle binary data better, while higher values are more appropriate for continuous, which is especially noticeable on the "only 6-way" case. We think it might be related to the fact that high order interactions between binary variables are very rare in the data (i.e. product of 6 binary variables is equal to 0 most of the times) and we need a prior eager to explore (α = 0) to find them. 5.2 Real world applications 5.2.1 Finding epistasis Identifying epistasis (i.e. interactions between genes) is one of the major questions in the field of human genetics. Interactions between multiple genes and environmental factors can often tell a lot more about the presence of a certain disease than any of the genes individually (Templeton, 2000). Our analysis of the epistasis is based on the data from Himmelstein et al. (2011). These authors show that interactions between single nucleotide polymorphisms (SNPs) are often powerful predictors of various diseases, while individually SNPs might not contain important information at all. They developed a model free approach to simulate data mimicking relationships between complex gene interactions and the presence of a disease. We used datasets with five SNPs and either 3-,4and 5-way interactions or only 5-way interactions. For this experiment we compared Mi FM1, Mi FM0; refitted logistic regression for each of our models based on the selected interactions (LMi FM1 and LMi FM0), Multilayer Perceptron with 3 layers and Random Forest.3 Results in Table 1 demonstrate that Mi FM produces competitive performance compared to the very best black-box techniques on this data set, while it also selects interacting genes (i.e. finds epistasis). We don t know which of the 3and 4-way interactions are present in the data, but since there is only one possible 5-way interaction we can check if it was identified or not both Mi FM1 and Mi FM0 had a 5-way interaction in at least 95% of the posterior samples. 3FM, SVM and logistic regression had low accuracy of around 50% and are not reported. Table 1: Prediction Accuracy on the Held-out Samples for the Gene Data Mi FM1 Mi FM0 LMi FM1 LMi FM0 MLP RF 3-, 4-, 5-way 0.775 0.771 0.883 0.860 0.870 0.887 only 5-way 0.649 0.645 0.628 0.623 0.625 0.628 0.25 0.00 0.25 0.50 1 4 7 10 12 Month of the year Mi FM_1 coefficient 2013 2014 2015 0.25 0.00 0.25 0.50 1 4 7 10 12 Month of the year 2013 2014 2015 1.0 0.5 0.0 0.5 1.0 0 10 20 30 40 50 Week of the year Mi FM_0 coefficient Fri Sat Sun 1.0 0.5 0.0 0.5 1.0 0 10 20 30 40 50 Week of the year Fri Sat Sun Figure 3: Mi FM1 store - month - year interaction: (a) store in Merignac; (b) store in Perols; Mi FM0 city - store - day of week - week of year interaction: (c) store in Merignac; (d) store in Perols. 5.2.2 Understanding retail demand We finally report the analysis of data obtained from a major retailer with stores in multiple locations all over the world. This dataset has 430k observations and 26 variables spanning over 1100 binary variables after the one-hot encoding. Sales of a variety of products on different days and in different stores are provided as response. We will compare Mi FM1 and Mi FM0, both fitted with K = 12 and J = 150, versus Factorization Machines in terms of adjusted mean absolute percent error AMAPE = 100 n |ˆyn yn| P n yn , a common metric for evaluating sales forecasts. FM is currently a method of choice by the company for this data set, partly because the data is sparse and is similar in nature to the recommender systems. AMAPE for Mi FM1 is 92.4; for Mi FM0 - 92.45; for FM - 92.0. Posterior analysis of predictor interactions The unique strength of Mi FM is the ability to provide valuable insights about the data through its posterior analysis. Mi FM1 recovered 62 non-linear interactions among which there are five 3-way and three 4-way. Mi FM0 selected 63 non-linear interactions including nine 3-way and four 4-way. We note that choice α = 0 was made to explore deeper interactions and as we see Mi FM0 has more deeper interactions than Mi FM1. Coefficients for a 3-way interaction of Mi FM1 for two stores in France across years and months are shown in Fig. 3(a,b). We observe different behavior, which would not be captured by a low order interaction. In Fig. 3(c,d) we plot coefficients of a 4-way Mi FM0 interaction for the same two stores in France. It is interesting to note negative correlation between Saturday and Sunday coefficients for the store in Merignac, while the store in Perols is not affected by this interaction - this is an example of how Mi FM can select interactions between attributes across categories. 6 Discussion We have proposed a novel regression method which is capable of learning interactions of arbitrary orders among the regression predictors. Our model extends Finite Feature Model and utilizes the extension to specify a hypergraph of interactions, while adopting a factorization mechanism for representing the corresponding coefficients. We found that Mi FM performs very well when there are some important interactions among a relatively high number (higher than two) of predictor variables. This is the situation where existing modeling techniques may be ill-equipped at describing and recovering. There are several future directions that we would like to pursue. A thorough understanding of the fully nonparametric version of the FFMα is of interest, that is, when the number of columns is taken to infinity. Such understanding may lead to an extension of the IBP and new modeling approaches in various domains. Acknowledgments This research is supported in part by grants NSF CAREER DMS-1351362, NSF CNS-1409303, a research gift from Adobe Research and a Margaret and Herman Sokol Faculty Award. Ai, Chunrong and Norton, Edward C. Interaction terms in logit and probit models. Economics letters, 80(1): 123 129, 2003. Brambor, Thomas, Clark, William Roberts, and Golder, Matt. Understanding interaction models: Improving empirical analyses. Political analysis, 14(1):63 82, 2006. Cheng, Chen, Xia, Fen, Zhang, Tong, King, Irwin, and Lyu, Michael R. Gradient boosting factorization machines. In Proceedings of the 8th ACM Conference on Recommender systems, pp. 265 272. ACM, 2014. Cordell, Heather J. Detecting gene gene interactions that underlie human diseases. Nature Reviews Genetics, 10 (6):392 404, 2009. Cristianini, Nello and Shawe-Taylor, John. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000. Fan, Jianqing and Lv, Jinchi. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101, 2010. Freudenthaler, Christoph, Schmidt-Thieme, Lars, and Rendle, Steffen. Bayesian factorization machines. 2011. Ghahramani, Zoubin and Griffiths, Thomas L. Infinite latent feature models and the Indian buffet process. In Advances in neural information processing systems, pp. 475 482, 2005. Ghosal, Subhashis, Ghosh, Jayanta K, Ramamoorthi, RV, et al. Posterior consistency of Dirichlet mixtures in density estimation. The Annals of Statistics, 27(1):143 158, 1999. Griffiths, Thomas L and Ghahramani, Zoubin. The Indian buffet process: An introduction and review. The Journal of Machine Learning Research, 12:1185 1224, 2011. Harshman, Richard A. Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multi-modal factor analysis. 1970. Himmelstein, Daniel S, Greene, Casey S, and Moore, Jason H. Evolving hard problems: generating human genetics datasets with a complex etiology. Bio Data mining, 4(1):1, 2011. Nguyen, Trung V, Karatzoglou, Alexandros, and Baltrunas, Linas. Gaussian process factorization machines for context-aware recommendations. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 63 72. ACM, 2014. Rendle, Steffen. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995 1000. IEEE, 2010. Rendle, Steffen, Gantner, Zeno, Freudenthaler, Christoph, and Schmidt-Thieme, Lars. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 635 644. ACM, 2011. Templeton, Alan R. Epistasis and complex traits. Epistasis and the evolutionary process, pp. 41 57, 2000. Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996. Zhu, Ji, Rosset, Saharon, Hastie, Trevor, and Tibshirani, Rob. 1-norm support vector machines. Advances in neural information processing systems, 16(1):49 56, 2004. Zou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.