# polyhedron_attention_module_learning_adaptiveorder_interactions__30af2926.pdf

Polyhedron Attention Module: Learning Adaptive-order Interactions

Tan Zhu University of Connecticut tan.zhu@uconn.edu

Fei Dou University of Georgia fei.dou@uga.edu

Xinyu Wang University of Connecticut xinyu.wang@uconn.edu

Jin Lu University of Georgia jin.lu@uga.edu

Jinbo Bi University of Connecticut jinbo.bi@uconn.edu

Learning feature interactions can be the key for multivariate predictive modeling. Re LU-activated neural networks create piecewise linear prediction models. Other nonlinear activation functions lead to models with only high-order feature interactions, thus lacking of interpretability. Recent methods incorporate candidate polynomial terms of fixed orders into deep learning, which is subject to the issue of combinatorial explosion, or learn the orders that are difficult to adapt to different regions of the feature space. We propose a Polyhedron Attention Module (PAM) to create piecewise polynomial models where the input space is split into polyhedrons which define the different pieces and on each piece the hyperplanes that define the polyhedron boundary multiply to form the interactive terms, resulting in interactions of adaptive order to each piece. PAM is interpretable to identify important interactions in predicting a target. Theoretic analysis shows that PAM has stronger expression capability than Re LU-activated networks. Extensive experimental results demonstrate the superior classification performance of PAM on massive datasets of the click-through rate prediction and PAM can learn meaningful interaction effects in a medical problem.

1 Introduction

Learning feature interactions provides insights into how predictors (features in x) interact to vary a dependent variable y, and could significantly improve prediction performance in a wide range of research areas, such as recommendation systems [1, 2, 3, 4], genetic analysis [5, 6] and neuroscience [7, 8], and help explain the decision-making of a complex model.

Interactions among features can significantly improve the model s expression capability. For a simple example, by incorporating the two-way interaction effect between gender (0/1: female and male) and age into the linear regression model height w1 gender + w2 age + w3 gender age + w0 (where w0, w1, w2, w3 are trainable parameters), the effects of female s age on the heights will be different from that of male (w2 v.s. w2 + w3). A predictive model f(x) has a k-way interaction effect if it satisfies [9]:

Definition 1 (k-way interaction effect) Let x be the input of a function f(x) : Rp R, and xi be the ith feature of x. Let I {1, 2, ..., p} be a set of feature indices and k (k > 2) be the cardinality of I. If Ex[ |I|f(x) xi1 xi2... xik ]2 > 0, (1)

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

f(x) exhibits a k-way interaction effect among features indexed by i1, i2, ..., ik I.

A k-th order polynomial f(x) can define k-order interactions which may not be k-way. A k-way interaction must occur among k different features. For instance, a 2-way interaction between xi and xj may occur as xixj or x2 i x3 j, so can be generally written as P

mi,mj wmi,mjxmi i xmj j for some nonzero mi and mj if coefficients w s are nonzero. However, high dimensional features can generate enormous candidate interactions due to the combinatorial explosion, causing the curse of dimensionality. For example, k binary features can have 2k possible k-way interactions. This challenge in learning feature interactions attracts tremendous research interest in various models ranging from logistic regression to deep neural networks (DNN).

Early methods learn the effects of feature interaction by multiplying features together and employing the multiplications as input to create an additive model. Two lines of methods have been used to reduce the quantity of candidate interactions. Penalization methods (e.g., Hierarchical LASSO [10, 11] and Group-LASSO [12]) assume a hierarchy of interactions, so high-way interactions may be eliminated by removing low-way interactive terms. Factorization Machines (FM) require fewer parameters to fit all predefined interaction terms, which implicitly selects among candidate interactions [13, 14, 15]. Only 2and/or 3-way interactions were considered by these methods. To learn interactions among more features (higher-way), models such as FM-supported NN [16], product-based NN [17], Auto Int+ [18], DESTINE [19] and AFN [20] predict the target with sigmoidal-activated (such as the softmax or sigmoid function) DNNs. However, sigmoidal-activated DNN models only capture extremely high-order interactions among all features according to Definition 1, because sigmoidal activation functions have continuous (non-zero) derivatives up to infinite order with respect to all input features.

To capture interaction effects among a subset of features, an increasing number of works adopt Re LU-activated plain DNN which fits a piecewise linear function [21], as the backbone to develop interaction learning models. Models such as Mask Net [22] first form interaction effects and then use them as the input of a Re LU-activated DNN, also known as the stacked structure. Recently, more state-of-the-art performance has been achieved with models such as Deep IM [23], Deep FM [24], DCN [25], x Deep FM [26], AOANet [4], DCN-V2 [25], EDCN [27] and Final MLP [28] in a parallel structure, which combines outputs of the proposed feature interaction learning models with those of a Re LU-activated plain DNN by operators such as summation, concatenation or Hadamard product.

By fusing Re LU-activated DNNs with models that explicitly incorporate interaction effects, piecewise polynomial functions can be formed. Rather than predicting the target with a highly non-linear function infinitely differentiable on the input space, these methods divide the input space into pieces (e.g., polyhedrons), map instances to different pieces, and predict the target with a piece-specific polynomial function. For all existing models, the piece-specific polynomial functions fit the same format of interactions (fixed Q xm1 i1 xm2 i2 xmk ik terms for a specific k) across all pieces once they are identified. However, data instances belonging to different pieces may endorse interactions in different ways. To address this problem, we propose a new type of attention module called Polyhedron Attention Module (PAM). The main contribution of this paper are summarized as follows:

PAM splits the input space into polyhedrons and predicts the target using a piecewise polynomial function which contains an attention term for each piece. The attention captures interaction effects of adaptive orders for different polyhedrons according to their local structure. We propose a model interpretation approach for PAM to identify important interaction effects for any given data instance. We prove a universal approximation theorem for PAM, and show that DNNs incorporating PAMs need fewer parameters than Re LU-activated plain DNNs to maintain the same level of approximation error in fitting functions in the Sobolev space. Empirical studies are designed to validate PAM on benchmark datasets that are previously recognized to require feature interaction learning, such as the massive Criteo (33 million samples, 2.1 million features) and Avazu (28.3 million samples, 1.5 million features) click-through-rate datasets. Using a medical database with feature meanings, we explore the interpretation of PAM.

2 Motivation - Reinterpreting Re LU-based DNN

We argue that a plain fully-connected DNN with Re LU activation function may be interpreted as an attention mechanism. Such a neural network contains L consecutive blocks each constituting

a fully-connected layer and a Re LU-activation layer. Let xℓbe the input of the ℓ-th block which outputs a nested function Re LU(W ℓxℓ+ bℓ). The Re LU(z) returns z when activated (if z 0) or 0 otherwise. The consecutive L blocks together create piecewise linear functions (each output in layer L defines such a function). For a given raw input x, let Iℓcontains the indices of the activated Re LU functions in layer ℓwhen x passes through the layer, and WIℓand b Iℓcontain the original weights, respectively, in W ℓand bℓfor those rows indexed in Iℓand 0 vectors for those in the complement set of Iℓ(denoted by Iℓ ). Then, at a layer ℓ, W (ℓ)x + b(ℓ) 0 where W (ℓ) = Qℓ j=1 WIj and b(ℓ) = Pℓ j=1 b Ij Qℓ k=j+1 WIk. More precisely, it means that x stays in a polyhedron that is defined by the intersection of all the half-spaces {x : W (ℓ)x + b(ℓ) 0} and {x : W (ℓ )x + b(ℓ ) < 0} for all ℓ= 1, , L 1 where W (ℓ ) = WIℓ Qℓ 1 j=1 WIj and

b(ℓ ) = b Iℓ + Pℓ 1 j=1 b Ij WIℓ Qℓ 1 k=j+1 WIk specify those affine functions not activated in layer ℓ. Here, WIℓ and b Iℓ contain the original weights in W ℓand bℓfor those rows indexed in Iℓ and 0 vectors for those in Iℓ. For all x in this polyhedron, the L-th layer outputs affine functions W (L)x + b(L).

Figure 1: An example of 2-layer Re LU-activated plain DNN with 1 output (the green shading shows the function value). Black lines are fitted in layer 1 (their intersection defines the polyhedrons) and the red line is fitted in layer 2 and varies on different polyhedrons.

Precisely, the input space is divided into non-overlapping polyhedrons and each polyhedron is defined by a combination of activated Re LU functions across the layers (a formal proof is given in Appendix A). To ease the notation, we simply use to denote the set of indices of the activated Re LU across all layers that identify the polyhedron and denotes the index set of the inactivated Re LU in the layers. We use an indicator function 1(x ) to define the polyhedron which returns 1 for all vectors x that satisfy W (ℓ) x+b(ℓ) 0 and W (ℓ ) x + b(ℓ ) < 0 for ℓ= 1, , L 1, and 0 otherwise. The i-th activated Re LU in the L-th layer corresponds to a piecewise linear function that computes W (L) ,i x + b(L) ,i on each piece .

Note that W (L) ,i and b(L) ,i vary for different polyhedrons due to the differences in the Re LU activation in the early layers (illustrated in Fig. 1). Denote the hyperplane {x : W (L) ,i x + b(L) ,i = 0} by H ,i,L, which further splits into two

polyhedrons 1 = 1(x , W (L) ,i x + b(L) ,i 0) and 2 = 1(x , W (L) ,i x + b(L) ,i < 0). Thus, this piecewise linear function can be written as:

1(x ) Re LU(W (L) ,i x + b(L) ,i) = P

1(x ) Re LU( W (L) ,i x+b(L) ,i ||W (L) ,i || ||W (L) ,i ||)

1(x 1)dist(x, H ,i,L) | {z } attention

||W (L) ,i || | {z } value on 1

+ 1(x 2)dist(x, H ,i,L) | {z } attention

0 |{z} value on 2 (2) where dist(x, H ,i,L) means the distance from x to the hyperplane H ,i,L. The values on the two pieces 1 and 2 are constant ||W (L) ,i || where || || is the ℓ2 vector norm or 0. The attention is a function of x and depends on how far x is from one of the hyperplanes that define the polyhedron boundary (see Fig. 1 for illustration).

We observe that a polyhedron is defined using a sequence of hyperplanes corresponding to the affine functions in different layers, but the attention of Re LU-activated DNNs is calculated based only on the hyperplane in the last layer for the polyhedron (piece). Although not all of the hyperplanes in early layers make the active boundary of a polyhedron (i.e., the polyhedron can locate in the interior of a half-space), using only one hyperplane in the attention is restrictive. An attention mechanism that allows multiple active boundary hyperplanes of a polyhedron to aid attention calculation may increase the model s expression power. Let H contain all of the active boundary hyperplanes H of . For convenient notation, and with mild relaxation, we rescale the WH to WH for those inactivated affine functions in the DNN so the half-spaces can all be written in the form of WHx + b H 0. Then, 1(x ) = Q

H H 1(WHx+b H 0). To learn feature interaction effects, we multiply the distances from x to each hyperplane in H . Given a hyperplane is linear in terms of x, multiplying the distances from x to two (m) hyperplanes creates quadratic (m-th order) terms. Thus, the number of active boundary hyperplanes of offers the upper bound on the order of the multiplicative terms

Figure 3: Overview description of the proposed method: (A) Split the input space into overlapping polyhedrons with the oblique tree. (B) Calculate attention on each polyhedron based on distances to the polyhedron s boundaries. (C) Calculate the model s output based on the attention and value vector. (D) Extract main and interaction effects from the model.

if each hyperplane is used only once. To allow the order to be adaptive to each within the upper limit, we further enclose a bounding constraint on the distance (other strategies may exist, which we leave for future investigation). Thus, the new attention formula can be written as follows:

a (x) =1(x ) Q

H H min(dist(x, H), 2UH ||WH||)

H H 1(WHx + b H 0) min(dist(x, H), 2UH ||WH||)

H H max min WHx+b H

||WH|| , 2UH ||WH|| , 0 ,

where UH determines an upper bound on the distance from x to H and is a trainable parameter. Fig. 2 demonstrates how adding the upper bounds UH allows the attention module to learn interaction effects of adaptive orders. For example, the instances in the area marked by 0 in the figure (left) are far from each boundary hyperplane H beyond their respective upper bounds, so the min operator returns a constant 2UH ||WH||. These x instances thus receive a constant attention that is the multiplication of these upper bounds each for an H H , which gives a 0-order interaction. When x is close to two of

Figure 2: For a triangular polyhedron in this example, the attention a can capture constant, additive, 2-way interactive, and 3-way interactive effects in the areas, respectively, marked by 0, 1, 2, and 3, by adjusting the upper bound parameter U which is smaller in the left figure than in the right.

the hyperplanes (in those areas labeled by 2), two distance terms are used in Eq. 3, defining two-way interactions.

Early methods pre-specify a set of polynomial terms and use them as input features to a DNN [23] which limits the search space. In another line of methods [20], each feature is associated with an order parameter mi and products in the form of Q xmi i are used as input (rather than raw features xi) to a Re LU-activated DNN. The DNN is required to also learn the proper values of mi. It is an excellent approach to identify useful interactions, but once these mi s are determined, the orders of feature interaction are fixed and cannot adapt to different pieces because the piecewise linear model (linear in terms of the input products) learned by the DNN does not change interaction orders. Unlike these methods, our attention module automatically identifies the interactions (polynomial terms) of appropriate orders according to the topology of a piece (particularly, using the number of active boundary hyperplanes of the polyhedron). Using introduced parameters UH, we can learn the appropriate order of the interactive terms within a polyhedron by adjusting UH (larger values of UH lead to higher-orders of feature interaction). Our theoretical analysis (Section 5) shows that a DNN incorporating our attention module can approximate any function represented by Re LU-activated DNN at any arbitrarily accurate level with fewer model parameters.

3 The Proposed Polyhedron Attention Module (PAM)

This section elaborates on the PAM with the attention score defined in Eq. 3. With input x Rp, a DNN using PAM defines a function f P AM: f P AM(x) = V (x; θG) + P a (x)V (x; θ ), (4)

where V (x; θG) is a global value function with trainable parameter θG, and V (x; θ ) is the local value function on a piece with trainable parameters θ . We set both the global and local value

functions to be affine functions, so θ contains the normal vector W and offset b. Sec 3.2 explains why affine functions are our choice. Instead of forming polyhedrons by intersecting hyperplanes as done in Re LU-activated DNNs, we use a tree search to partition the input space into overlapping polyhedrons. Note that the potential set of partitions created by tree search is a superset of that created by hyperplance intersections (see Appendix B). We introduce how we generate polyhedrons and then discuss the attention and value functions.

3.1 Generating polyhedrons via oblique tree

Let S contain all the polyhedrons needed to form a partition of the input space. We adopt the oblique tree to generate S . An oblique tree is a binary tree where each node splits the space by a hyperplane rather than by thresholding a single feature. The tree starts with the root of the full input space S, and by recursively splitting S, the tree grows deeper. For a D-depth (D 3) binary tree, there are 2D 1 1 internal nodes and 2D 1 leaf nodes. As shown in Fig. 3A, each internal and leaf node maintains a sub-space representing a polyhedron in S, and each layer of the tree corresponds to a partition of the input space into polyhedrons. Denote the polyhedron defined in node n by n, and the left and right child nodes of n by n L and n R. Unlike classic oblique trees that partition S into non-overlapping sub-spaces, we perform soft partition to split each n into n L and n R with an overlapping buffer. Let the splitting hyperplane be {x Rp : Wnx + bn = 0}. Then the two sub-spaces n L and n R are defined as follows:

n L = {x n| Wnx + bn Un}, n R = {x n| Wnx bn Un}, (5)

where Un indicates the width of the overlapping buffer. Eq. 5 shows that those instances satisfying |Wnx + bn| < Un belong to the buffer in both n L and n R. This buffer creates a symmetric band around the splitting hyperplane.

Let Pn be the node set containing all the ancestor nodes above n. We can group the nodes in Pn into two subsets Pl n or Pr n where n appears in the left or right subtree of those ancestors. Let i index the nodes in Pl n and j index the nodes in Pr n. Then for any node n except the root node, n can be expressed as the intersection of the half-spaces:

n = {x Rp : Wix + bi Ui, i Pl n, and Wjx bj Uj, j Pnr}. (6)

Based on the polyhedron defined by Eq. 6, the attention a n(x) (which we refer to as an(x) to simplify notation) can be rewritten as

i Pln max(min( Wix+bi+Ui

||Wi|| , 2Ui ||Wi||), 0) Q

i Prn max(min( Wix bi+Ui

||Wi|| , 2Ui ||Wi||), 0), (7)

where we use the buffer width Ui to bound the corresponding distance term from above by 2Ui ||Wi||.

Let Nd consist of all the nodes at the dth layer of the tree, d {1, 2, ..., D}. The nodes in Nd altogether specify a soft partition of the input space that maps an instance x to one sub-space or an intersection of overlapping sub-spaces in |Nd|. Rather than merely utilizing the instance partitioning map defined by the last layer of the tree (2D 1 polyhedrons), we allow PAM to leverage polyhedrons generated at all layers in f P AM with S = D d=2{ n|n Nd} which gives 2D 2 polyhedrons.

3.2 Learning the attention and value functions

Each internal node n in the oblique tree needs to learn two affine functions: a splitting function used to form a hyperplane to split the sub-space into child nodes, and a value function. Because nested affine functions still produce affine functions, the set of affine functions is closed under linear transformations. Thus, we can use the value function to absorb the denominator ||Wi|| from the attention Eq.7. In other words, for any learned θn, we can find a θ n (by rescaling θn with those 1 ||Wi||) such that an(x)Vn(x, θn) = a n(x)Vn(x, θ n) where a n(x) = Q

i Pln max(min(Wix + bi + Ui, 2Ui), 0) Q

i Prn max(min( Wix bi + Ui, 2Ui), 0) and we use the subscript n to denote the polyhedron represented in node n. We thus directly set off to learn θ n and use a n as an.

More importantly, with affine value functions, we can derive another property. For any internal node n, the attention of its two child nodes contains an as a factor, so an L(x) = an(x) max(min(Wnx +

bn + Un, 2Un), 0) and an R(x) = an(x) max(min( Wnx bn + Un, 2Un), 0). It gives rise an observation that no matter where x is located (inside the buffer or outside), an L(x) + an R(x) = 2Unan(x). Thus, we can substitute an L(x) in the model f P AM by 2Unan(x) an R(x). Then, an(x)V (x, θn) + an L(x)V (x, θn L) + an R(x)V (x, θn R) = an(x) (V (x, θn) + 2Un V (x, θn L)) + an R(x) (V (x, θn R) V (x, θn L)), which can be written as an(x)V (x, θ)+an R(x)V (x, θ) for some parameters θ and θ due to the closure of affine functions under linear transformations. Hence, once again, we directly learn θ and θ in our model training process. Note that we can recursively apply the above subsitution for all internal nodes n, so we can reduce the polyhedrons in S by half. The following theorem characterize the above discussion (proof is in Appendix C.)

Theorem 1 If all value functions V belong to a function set that is closed under linear transformations, then the function learned by PAM f P AM can be equivalently written as

f P AM(x) = V (x, θG) + P

n S an(x)V (x, θn) (8)

where the polyhedron set S contains half of the polyhedrons (e.g., the right child nodes or the left child nodes) in S and

i Pln max(min(Wix+bi+Ui, 2Ui), 0) Q

i Prn max(min( Wix bi+Ui, 2Ui), 0). (9)

Remark 1 Although we include polyhedrons identified by the internal nodes in our calculation, f P AM only needs to learn 1

2(2D 2) = 2D 1 1 value functions, which is actually in the same scale as that of only using leaf nodes in f P AM.

Optimization of PAM. The output of f P AM(x) can be treated as a prediction of x s label or an embedding of x. PAM can be used as a constituent component in a DNN to approximate a target y = f(f P AM(x)). For a classification task, we can calculate the conditional distribution ˆy = Pr(y|x) = softmax(f(f P AM(x))) and optimize the cross-entropy LCE = E(x,y) Dy log ˆy (1 y) log(1 ˆy) between the observed y and estimated ˆy to determine the parameters in PAM. For a regression task, the mean square loss LMSE = E(x,y) D(y f(f P AM(x)))2 can be used.

4 Model Interpretation

We propose a conceptual framework to quantify and interpret the interaction effects learned by PAM. Without loss of generality, we assume that the DNN has a single output. Additional outputs can be similarly interpreted using the derived algorithm. The f P AM can be rewritten as a summation of k-way interaction terms for all possible values of k D: f P AM(x) = P

I {1,2,...,p} ϕI(x) where ϕI(x) captures the total contribution to the output from the |I|-way interactions among the features indexed in I. If I = , ϕI(x) = PP mi D w I Q

i I xmi i where xi represents the ith feature of x, mi calculates the power of xi in the interaction, and w s are constant coefficients in front of the corresponding interaction terms. Given the definition of our attention in Eq.9, the highest polynomial order is D 1 in the attention, together with the affine value function, the highest polynomial order of f P AM is D, so P mi can not exceed the depth of the tree. If I = , ϕI = w which is a constant. We develop a method here to estimate the contribution values ϕI in f P AM = P

I {1,2,...,p} ϕI for a fixed input x without computing the individual polynomial terms.

Algorithm 1: Obtain ϕI for an input x

1: Input: input x, g(x), and an I {1, 2, ..., p} 2: Output: ϕI 3: Set 0 I to be a p-length vector with 1 s in the positions indexed by I and 0 s elsewhere 4: Calculate the following function: (Note that ϕI (x) is also calculated via the same recursive function.)

( g(0 I x) P

I I ϕI (x), I = , g(0), I = ,

where is the Hadamard product operator.

For a given x, f P AM can be written explicitly out as g(x) according to which polyhedron(s) x belongs to. To determine g(x), we pass x through the model Eq.8 and evaluate every term in Eq.9. For instance, for the first product in Eq.9, if x makes Wix + bi Ui, we replace the term max(min(Wix + bi + Ui, 2Ui), 0) by 2Ui; if Wix + bi Ui, we replace it by 0; or otherwise, we use Wix + bi + Ui. Once the g function is computed, we can use Algorithm 1 to evaluate the contribution ϕI, I {1, , p}. A simple example can be used to demonstrate Algorithm 1. Let I = {1, 2} which means we calculate the sum of those cross terms that involve exactly x1 and x2. Thus we set all other elements in x to 0

and calculate g(0 I x) to obtain the value v that adds the terms involving only x1 and x2. We then additionally set either x1 = 0 or x2 = 0, or x1 = x2 = 0 (i.e., make all elements 0), re-compute g to estimate the linear terms of either x1 or x2, or the constant term in g, and subtract these values from v to eventually obtain ϕI. The following theorem characterize our results.

Theorem 2 For any input x, by calculating ϕI(x) for each I {1, 2, ..., p} via Algorithm 1, we have P

I {1,2,...,p} ϕI(x) = f P AM(x).

The instance-level explanation of f P AM can be obtained by examining the magnitude of ϕI which reflects the impact of the feature interaction among the features in I. If ϕI(x) > 0, the interaction increase the predicted value; if ϕI < 0, it reduces the output. The model-level interpretation can be approximated by computing the mean absolute value of the ϕI across all sample instances.

5 Theoretical Justification - Approximation Theorems

We examine whether using PAM can enhance the expression power for universal approximation. We first introduce the Sobolev space, which characterizes a space of functions satisfying specific smoothness properties - Lipschitz continuous up to order n - which is formally defined as:

Definition 2 (Sobolev space) Let Wn, ([0, 1]p) be the Sobolev space which comprises of functions on [0, 1]p lying in L along with their weak derivatives up to order n. The norm of a function f in Wn, ([0, 1]p) is ||f||Wn, ([0,1]p) = max n:|n| n ess sup x [0,1]p |Dnf(x)|,

where n = (n1, n2, ..., np) {1, 2, ..., n}p, |n| = n1 + n2 + ... + np n, and Dnf is the n-order weak derivative. The essential supreme ess sup g(E) = inf{M R : µ({x E : f(x) > M}) = 0} captures the smallest value that the function g can approach or exceed on a set E, except for a negligible subset of points with the measure µ. Essentially, the space Wn, ([0, 1]p) is Cn 1([0, 1]p) whose functions derivatives up to order n are Lipschitz continuous.

The following assumption is commonly used in the discussion of DNNs. Without loss of generality, it narrows our focus on a normalized Sobolev sphere. This assumption constrains the functions having Sobolev norm no greater than 1 within the sphere.

Assumption 1 Let Fn,p be a set of functions lying in the unit ball in Wn, ([0, 1]p), we have

Fn,p = {f Wn, ([0, 1]p) : ||f||Wn, ([0, 1]p) 1}.

This assumption is sufficient for our analysis, as functions encountered in real-world learning tasks can typically be linearly transformed into Wn, ([0, 1]p), as shown in previous studies [29]. This allows us to analyze the error bounds for terms in the polynomial approximation after performing Taylor expansion. Theorem 3 demonstrates that our model can almost surely approximate any Re LU-activated DNN model without error. All proofs can be found in Appendix E and F.

Theorem 3 If x is bounded and sampled from a distribution with upper-bounded probability density function, then for any Re LU activated plain DNN model f DNN(x), there exists a PAM with

Pr(f P AM x) = f DNN(x) 1.

Theorem 4 examines the parameter efficiency, and demonstrates that networks incorporating the proposed polyhedron attention require fewer parameters compared to those relying solely on Re LU activation, while maintaining the same approximation error in fitting functions in the Sobolev space.

Theorem 4 For any p, n > 0 and ϵ (0, 1), we have a PAM which can 1) approximates any function from Fn,p with an error bound ϵ in the sense of L with at most 2pn(N +1)p(p+n 1) parameters, where N = ( n! 2ppn ϵ

Remark 2 For the purpose of comparison, the Re LU-activated plain DNN needs pn(N + 1)p(p + 1)n O(log(1/ϵ)) parameters under the same setting in Theorem 4 [30, 29].

It is worth noting that extensive research has been conducted on the approximation theory of DNNs with the Re LU-activation, which often concerns common function classes in specific function spaces such as Besov [31, 32] and Sobolev spaces [33, 34]. These analyses reveal a notable influence of the smoothness of the target function on the resulting approximation errors. Since Sobolev spaces characterize the smoothness properties of functions, we can investigate the ability of neural networks with Re LU activation to approximate functions with different degrees of regularity and smoothness. These theorems highlight the expressivity of our model and provide theoretical insights for the parameter efficiency of our proposed attention module in neural network architectures.

6 Empirical Evaluation

We evaluate the effectiveness and efficiency of PAM on three large-scale datasets: the Criteo1 and Avazu1 click-through-rate (CTR) datasets, and the UK Biobank2 medical database. We conduct an analysis of the hyperparameters of PAM, and perform ablation studies by individually removing each of the three key components of PAM and evaluating the performance variations. Given the lack of known feature meanings in CTR benchmark datasets, we utilize the UK Biobank dataset as an example for studying model interpretation. Specifically, we validate the interaction effects captured by our interpretation framework, as detailed in Sec. 4, in the prediction of brain-age by the grey matter volumes from distinct brain regions. For implementation details, computation and space complexity of our model, please refer to Appendix G.

6.1 Experimental Setup

Datasets. Both the Criteo and the Avazu are massive industry datasets containing feature values and

Table 1: Statistics of the datasets.

Dataset #Train #Valid #Test #Features

Criteo 33M 8.3M 4.6M 2.1M Avazu 28.3M 4M 8.1M 1.5M UK Biobank 31.8K 4K 4K 139

click feedback for display ads, and are processed following the benchmark protocol in BARS [35, 36]. The UK Biobank serves as a comprehensive biomedical database and research resource, offering extensive genetic and health-related information, where our objective is to predict participants age by leveraging the grey matter volumes from 139 distinct brain regions. The summary statistics are listed in Table 1. Evaluation metrics. Criteo and Avazu datasets are concerned with binary classification tasks, evaluated using Area Under the ROC Curve (AUC). For brain-age prediction, a regression problem, we assess performance using the R2 score. Baseline Algorithms. We compare PAM against the top 5 algorithms from a pool of 33 baseline methods, selected based on the overall AUC scores in BARS, including DESTINE [19] (currently the best performer on Avazu), DCN [37], AOANet [4], EDCN [27] (best on Criteo), and DCN-V2 [25]. Given that our model and the selected algorithms are DNN-based, we also include a well-tuned DNN1 as a strong baseline for our model comparison. It is worth noting that several state-of-the-art studies on CTR datasets, such as Criteo and Avazu, from Google [2, 24, 37], Microsoft [38] and Huawei [24], have recognized that even a marginal improvement of AUC at the level of 1 is considered a significant performance enhancement.

6.2 Effectiveness and efficiency evaluation

Performance Comparison. We compare the results over five runs. As shown in Fig. 4a, PAM achieves the highest AUC and R2 score amongst all benchmark algorithms. PAM outperforms the second-best model by 1.4 for Criteo, 0.6 for Avazu, and 2.2 for UK Biobank, which is much higher than the improvement achieved by the second-best model from the next model (0.4 for Criteo, 0.16 for Avazu and 0.6 for UK Biobank). It is also worth noting that, despite being well-tuned, none of the benchmark algorithms performs optimally across all three datasets simultaneously. We evaluate the computation time of PAM by comparing the training and inferecing runtime (seconds) per batch to the number of trainable parameters (#Par). As shown in Table 2, almost all models have a similar amount of free parameters, and their runtime per batch is comparable among the methods (PAM is superior than half of the methods and inferior than the other half.)

1https://github.com/openbenchmark/BARS/tree/master; 2https://www.ukbiobank.ac.uk/

Figure 4: Experimental results. Error bars represent the means and standard deviations of AUC for classification and R2 for regression. a) AUC and R2 scores of PAM and comparison methods. b) Hyper-parameter analysis of PAM. U: the initial value of Un. D: the depth of the oblique tree in PAM. c) Ablation studies of PAM.

Table 2: The number of parameters (#Par) and seconds per batch (Sec/Batch) during the training and inference of PAM and baseline models.

Model Criteo Avazu UK Biobank

#Par Sec/Batch #Par Sec/Batch #Par Sec/Batch

Training Interence Training Interence Training Interence

PAM 22.3M 0.102 0.0330 14.1M 0.090 0.0263 315K 0.083 0.0018 DNN 21.8M 0.046 0.0139 13.8M 0.038 0.0129 380K 0.056 0.0063 DESTINE 21.5M 0.130 0.0344 13.6M 0.090 0.0188 384K 0.072 0.0069 DCN 21.3M 0.044 0.0114 13.4M 0.080 0.0099 381K 0.069 0.0064 DCN-V2 22.0M 0.103 0.0128 13.8M 0.091 0.0137 458K 0.059 0.0067 AOANet 21.4M 0.151 0.0314 13.4M 0.338 0.0168 457K 0.066 0.0067 EDCN 21.5M 0.066 0.0119 13.1M 0.048 0.0113 63K 0.072 0.0071

Hyper-parameter Study. Although Ui s are trainable parameters in PAM, their initialization may affect how well PAM works. We hence study the effects of the initial values of Ui s by testing a set of choices from 1 to 3 by a step size of 0.5. We also examine the effects of the tree depth D on PAM s performance. Fig. 4b shows the comparison where the initial values of Un do not substantially change PAM s performance, which indicates that Un may be well trained during the PAM optimization. As for the tree depth D, the performance of PAM initially improves as D increases, but when D becomes large, the performance may deteriorate. An appropriate value of D will help PAM to learn more domain knowledge, but if D is too large, highly complex interactions may not necessarily lead to better prediction due to issues such as overfitting.

Ablation Study of PAM. The following specifies the three components for our ablation experiments. 1) PAM without overlapping polyhedrons (PAM w/o OP). As shown in Eqs. 5, 6 and 7, overlapping polyhedrons are generated in PAM to fit the target. To demonstrate the importance of soft splitting, we calculate an(x) with Q i Pln max(min(Wix + bi, 2Ui), 0) Q i Prn max(min( Wix bi, 2Ui), 0). 2) PAM without adaptively learning interaction (PAM w/o ALI). As shown in Fig. 5, the upperbound 2Ui enables PAM to learn interactions with different orders in different polyhedrons. To exam the effectiveness of the upper bound, we remove this upper bound 2Ui from Eq. 9. 3) PAM without removing ||Wi|| (PAM w/o RW). According to Theorem 1, the denominator of Eq. 7 can be removed without reducing the expression capability of PAM. To examine the correctness of this claim, we directly use PAM without removing ||Wi|| from the denominator. Fig. 4c clearly demonstrates a notable decline in the performance of the PAM when key components are removed. In particular, the standard PAM significantly outperforms the PAM w/o OP. It confirms the critical role of overlapping polyhedrons in enhancing PAM s performance. The removal of ALI decreases the AUC and R2 of PAM as well, indicating the significance of combining low-way interaction effects with high-way ones to fit the target. Moreover, PAM w/o RW shows that removing ||Wi|| from the denominator in Eq. 7 improves PAM s performance. Although, according to Theorem 1, PAM without RW has the same expressive capability as standard PAM, the inclusion of W s norm in the denominator of Eq. 7 may result in unstable gradient variances, potentially compromising the performance of the optimizer.

6.3 Identified interaction effects among neural markers in predicting brain age

Although the interpretation framework was developed for PAM, it can be applied to those DNN architectures whose functions are piece-wise polynomials (see Appendix D). After carefully checking

Figure 5: Top 5 main and 2-way interactions identified by PAM (Figs. a and b), DCN (Figs. c and d) and AOANet (Figs. e and f) by sorting mean absolute values of each ϕI averaged over all participants. For each effect, points (each corresponding to a participant) are distributed according to ϕ values calculated by Algorithm 1 in the beeswarm bars. L: left; R: right; CR: crus II cerebellum; FOC: frontal orbital Cortex; FG: frontal gyrus; PP: planum polare; OCSD: lateral occipital cortex, superior division; FP: frontal pole; LOC: lateral occipital cortex; PG: precentral gyrus; T: thalamus; SC: subcallosal cortex.

all baseline methods, we found that our interpretation framework could be used to extract interactions from DCN and AOANet. Fig. 5 presents the top five main effects (Figs. a, c and e) and two-way interaction effects (Figs. b, d and f) between brain regions to brain-age prediction, as determined by the trained PAM, DCN and AOANet.

According to Figs. 5a and 5b, PAM found that the grey matter volumes (GMV) of both the left and right frontal poles (FP) play significant main effect as individual features, which is well aligned with previous research findings[39, 40, 41]. The other three brain regions, lateral occipital cortex (LOC), precentral gyrus (PG) and thalamus (T), are also discussed in early studies[42, 43, 44]. Existing aging studies primarily focus on main effects of GMV, but the top five two-way GMV interactions of identified brain regions have been acknowledged in works such as [42]. As shown in Fig. 5c, 5d, 5e and 5f, the main effects of PG and T identified by PAM were found by DCN and AOANet, respectively, and all three algorithms found the two-way interactions between the left and right FP. In addition to the shared top-5 brain regions identified by PAM, AOANet and DCN, PAM additionally identified the main effect of the LOC and FP and two-way interaction effects related to the insular and subcallosal cortex (SC) regions.

7 Conclusions and Future Work

We propose a novel feature interaction learning module, namely Polyhedron Attention Module (PAM), to fit a target with adaptive-order interaction effects. PAM produces a self-attention mechanism partitioning the input space into overlapping polyhedrons and learning the boundary hyperplanes for each polyhedron automatically. These hyperplanes multiply to identify interaction effects specific to individual polyhedrons. Under this mechanism, PAM automatically captures both simple and complex interaction effects. In our theoretic analysis, we show that PAM can enhance the model expression power for universal approximation. Experimental results demonstrate that the proposed model achieves better performance on massive datasets than the state-of-the-art. In the future, we will study how to dynamically pruning PAM s oblique tree during the training process to regularize the model. It will also be interesting to investigate how to define boundaries with non-linear functions.

Acknowledgments

This research has been conducted using the UK Biobank Resource under the Application 51296 Machine Learning Analysis of Brain Images, Behaviors, and Genotypes to Understand Mental Disorders, and was supported in part by the NIH grant K02-DA043063 and NSF award #2122309.

[1] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the eighth international workshop on data mining for online advertising, pages 1 9, 2014.

[2] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7 10, 2016.

[3] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191 198, 2016.

[4] Lang Lang, Zhenlong Zhu, Xuanye Liu, Jianxin Zhao, Jixing Xu, and Minghui Shan. Architecture and operation adaptive network for online recommendations. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3139 3149, 2021.

[5] Bing Wang, Steven M Smith, and Jiayang Li. Genetic regulation of shoot architecture. Annual review of plant biology, 69:437 468, 2018.

[6] Sumanta Basu, Karl Kumbier, James B Brown, and Bin Yu. Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences, 115(8):1943 1948, 2018.

[7] Shan Yu, Hongdian Yang, Hiroyuki Nakahara, Gustavo S Santos, Danko Nikoli c, and Dietmar Plenz. Higher-order interactions characterized in cortical activity. Journal of neuroscience, 31(48):17514 17526, 2011.

[8] Baiying Lei, Shuangzhi Yu, Xin Zhao, Alejandro F Frangi, Ee-Leng Tan, Ahmed Elazab, Tianfu Wang, and Shuqiang Wang. Diagnosis of early alzheimer s disease based on dynamic high order networks. Brain Imaging and Behavior, 15(1):276 287, 2021.

[9] Michael Tsang, Dehua Cheng, Hanpeng Liu, Xue Feng, Eric Zhou, and Yan Liu. Feature interaction interpretability: A case for explaining ad-recommendation systems via neural interaction detection. ar Xiv preprint ar Xiv:2006.10966, 2020.

[10] Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. Annals of statistics, 41(3):1111, 2013.

[11] Gregory Vaughan, Robert Aseltine, Kun Chen, and Jun Yan. Efficient interaction selection for clustered data via stagewise generalized estimating equations. Statistics in Medicine, 39(22):2855 2868, 2020.

[12] Michael Lim and Trevor Hastie. Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3):627 654, 2015.

[13] Steffen Rendle. Factorization machines. In 2010 IEEE International conference on data mining, pages 995 1000. IEEE, 2010.

[14] Harshit Pande. Field-embedded factorization machines for click-through rate prediction. ar Xiv preprint ar Xiv:2009.09931, 2020.

[15] Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. Fm2: Field-matrixed factorization machines for recommender systems. In Proceedings of the Web Conference 2021, pages 2828 2837, 2021.

[16] Weinan Zhang, Tianming Du, and Jun Wang. Deep learning over multi-field categorical data. In European conference on information retrieval, pages 45 57. Springer, 2016.

[17] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. Product-based neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 1149 1154. IEEE, 2016.

[18] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1161 1170, 2019.

[19] Yichen Xu, Yanqiao Zhu, Feng Yu, Qiang Liu, and Shu Wu. Disentangled self-attentive neural networks for click-through rate prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3553 3557, 2021.

[20] Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning adaptive-order feature interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3609 3616, 2020.

[21] Francesco Croce, Maksym Andriushchenko, and Matthias Hein. Provable robustness of relu networks via maximization of linear regions. In the 22nd International Conference on Artificial Intelligence and Statistics, pages 2057 2066. PMLR, 2019.

[22] Zhiqiang Wang, Qingyun She, and Junlin Zhang. Masknet: introducing feature-wise multiplication to ctr ranking models by instance-guided mask. ar Xiv preprint ar Xiv:2102.07619, 2021.

[23] Feng Yu, Zhaocheng Liu, Qiang Liu, Haoli Zhang, Shu Wu, and Liang Wang. Deep interaction machine: A simple but effective model for high-order feature interactions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2285 2288, 2020.

[24] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. ar Xiv preprint ar Xiv:1703.04247, 2017.

[25] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021, pages 1785 1797, 2021.

[26] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1754 1763, 2018.

[27] Bo Chen, Yichao Wang, Zhirong Liu, Ruiming Tang, Wei Guo, Hongkun Zheng, Weiwei Yao, Muyu Zhang, and Xiuqiang He. Enhancing explicit and implicit feature interactions via information sharing for parallel deep ctr models. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3757 3766, 2021.

[28] Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, and Zhenhua Dong. Finalmlp: An enhanced two-stream mlp model for ctr prediction. ar Xiv preprint ar Xiv:2304.00902, 2023.

[29] Feng-Lei Fan, Hang-Cheng Dong, Zhongming Wu, Lecheng Ruan, Tieyong Zeng, Yiming Cui, and Jing-Xiao Liao. One neuron saved is one neuron earned: On parametric efficiency of quadratic networks. ar Xiv preprint ar Xiv:2303.06316, 2023.

[30] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017.

[31] Taiji Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2019.

[32] Hao Liu, Minshuo Chen, Tuo Zhao, and Wenjing Liao. Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks. In International Conference on Machine Learning, pages 6770 6780. PMLR, 2021.

[33] Hao Liu, Minshuo Chen, Siawpeng Er, Wenjing Liao, Tong Zhang, and Tuo Zhao. Benefits of overparameterized convolutional residual networks: Function approximation under smoothness constraint. In International Conference on Machine Learning, pages 13669 13703. PMLR, 2022.

[34] Sho Okumoto and Taiji Suzuki. Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness. In International Conference on Learning Representations, 2022.

[35] Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. Bars: Towards open benchmarking for recommender systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 22, page 2912 2923, New York, NY, USA, 2022. Association for Computing Machinery.

[36] Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2759 2769, 2021.

[37] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proceedings of the ADKDD 17, pages 1 7. 2017.

[38] Xiaoliang Ling, Weiwei Deng, Chen Gu, Hucheng Zhou, Cui Li, and Feng Sun. Model ensemble for click prediction in bing search ads. In Proceedings of the 26th international conference on world wide web companion, pages 689 698, 2017.

[39] Jason Steffener, Christian Habeck, Deirdre O Shea, Qolamreza Razlighi, Louis Bherer, and Yaakov Stern. Differences between chronological and brain age are related to education and self-reported physical activity. Neurobiology of aging, 40:138 144, 2016.

[40] Dafna Sussman, Rachel C Leung, M Mallar Chakravarty, Jason P Lerch, and Margot J Taylor. The developing human brain: age-related changes in cortical, subcortical, and cerebellar anatomy. Brain and behavior, 6(4):e00457, 2016.

[41] Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford R Jack, William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett. Ways toward an early diagnosis in alzheimer s disease: the alzheimer s disease neuroimaging initiative (adni). Alzheimer s & Dementia, 1(1):55 66, 2005.

[42] Johnny Wang, Maria J Knol, Aleksei Tiulpin, Florian Dubost, Marleen de Bruijne, Meike W Vernooij, Hieab HH Adams, M Arfan Ikram, Wiro J Niessen, and Gennady V Roshchupkin. Gray matter age prediction as a biomarker for risk of dementia. Proceedings of the National Academy of Sciences, 116(42):21213 21218, 2019.

[43] David H Salat, Randy L Buckner, Abraham Z Snyder, Douglas N Greve, Rahul SR Desikan, Evelina Busa, John C Morris, Anders M Dale, and Bruce Fischl. Thinning of the cerebral cortex in aging. Cerebral cortex, 14(7):721 730, 2004.

[44] Rosemary Fama and Edith V Sullivan. Thalamic structures and associated cognitive functions: Relations with age and aging. Neuroscience & Biobehavioral Reviews, 54:29 37, 2015.