# modelling_interactions_in_highdimensional_data_with_backtracking__3cd66299.pdf

Journal of Machine Learning Research 17 (2016) 1-31 Submitted 11/13; Revised 3/16; Published 12/16

Modelling Interactions in High-dimensional Data with Backtracking

Rajen D. Shah r.shah@statslab.cam.ac.uk Statistical Laboratory University of Cambridge Cambridge, CB3 0WB, UK

Editor: Sara van de Geer

We study the problem of high-dimensional regression when there may be interacting variables. Approaches using sparsity-inducing penalty functions such as the Lasso can be useful for producing interpretable models. However, when the number variables runs into the thousands, and so even two-way interactions number in the millions, these methods may become computationally infeasible. Typically variable screening based on model ﬁts using only main eﬀects must be performed ﬁrst. One problem with screening is that important variables may be missed if they are only useful for prediction when certain interaction terms are also present in the model. To tackle this issue, we introduce a new method we call Backtracking. It can be incorporated into many existing high-dimensional methods based on penalty functions, and works by building increasing sets of candidate interactions iteratively. Models ﬁtted on the main eﬀects and interactions selected early on in this process guide the selection of future interactions. By also making use of previous ﬁts for computation, as well as performing calculations is parallel, the overall run-time of the algorithm can be greatly reduced. The eﬀectiveness of our method when applied to regression and classiﬁcation problems is demonstrated on simulated and real data sets. In the case of using Backtracking with the Lasso, we also give some theoretical support for our procedure. Keywords: high-dimensional data, interactions, Lasso, path algorithm

1. Introduction

In recent years, there has been a lot of progress in the ﬁeld of high-dimensional regression. Much of the development has centred around the Lasso (Tibshirani, 1996), which given a vector of responses Y Rn and design matrix X Rn p, solves

(ˆµ, ˆβ) := arg min (µ,β) R Rp { 1

2n Y µ1 Xβ 2 2 + λ β 1}, (1)

where 1 is an n-vector of ones and the regularisation parameter λ controls the relative contribution of the penalty term to the objective. The many extensions of the Lasso allow most familiar models from classical (low-dimensional) statistics to now be ﬁtted in situations where the number of variables p may be tens of thousands and even greatly exceed the number of observations n (see the monograph B uhlmann and van de Geer (2011b) and references therein).

2016 Rajen D. Shah.

However, despite the advances, ﬁtting models with interactions remains a challenge. Two issues that arise are:

(i) Since there are p(p 1)/2 possible ﬁrst-order interactions, the main eﬀects can be swamped by the vastly more numerous interaction terms and without proper regularisation, stand little chance of being selected in the ﬁnal model (see Figure 1b).

(ii) Monitoring the coeﬃcients of all the interaction terms quickly becomes infeasible as p runs into the thousands.

1.1 Related Work

For situations where p < 1000 or thereabouts and the case of two-way interactions, a lot of work has been done in recent years to address this need. To tackle (i), many of the proposals use penalty functions and constraints designed to enforce that if an interaction term is in the ﬁtted model, one or both main eﬀects are also present (Lin and Zhang, 2006; Zhao et al., 2009; Yuan et al., 2009; Radchenko and James, 2010; Jenatton et al., 2011; Bach et al., 2012a,b; Bien et al., 2013; Lim and Hastie, 2015; Haris et al., 2015). See also Turlach (2004) and Yuan et al. (2007), which consider modiﬁcations of the LAR algorithm Efron et al. (2004) that impose this type of condition. In the moderate-dimensional setting that these methods are designed for, the computational issue (ii) is just about manageable. However, when p is larger the situation of interest in this paper it typically becomes necessary to narrow the search for interactions. Comparatively little work has been done on ﬁtting models with interactions to data of this sort of dimension. An exception is the method of Random Intersection Trees (Shah and Meinshausen, 2014), which does not explicitly restrict the search space of interactions. However this is designed for a classiﬁcation setting with a binary predictor matrix and does not ﬁt a model but rather tries to ﬁnd interactions that are marginally informative. One option is to screen for important variables and only consider interactions involving the selected set. Wu et al. (2010) and others take this approach: the Lasso is ﬁrst used to select main eﬀects; then interactions between the selected main eﬀects are added to the design matrix, and the Lasso is run once more to give the ﬁnal model. The success of this method relies on all main eﬀects involved in interactions being selected in the initial screening stage. However, this may well not happen. Certain interactions may need to be included in the model before some main eﬀects can be selected. To address this issue, Bickel et al. (2010) propose a procedure involving sequential Lasso ﬁts which, for some predeﬁned number K, selects K variables from each ﬁt and then adds all interactions between those variables as candidate variables for the following ﬁt. The process continues until all interactions to be added are already present. However, it is not clear how one should choose K: a large K may result in a large number of spurious interactions being added at each stage, whereas a small K could cause the procedure to terminate before it has had a chance to include important interactions. Rather than adding interactions in one or more distinct stages, when variables are selected in a greedy fashion, the set of candidate interactions can be updated after each selection. This dynamic updating of interactions available for selection is present in the popular MARS procedure of Friedman (1991). One potential problem with this approach

Modelling Interactions

is that particularly in high-dimensional situations, overly greedy selection can sometimes produce unstable ﬁnal models and predictive performance can suﬀer as a consequence. The i FORT method of Hao and Zhang (2014) applies forward selection to a dynamically updated set of candidate interactions and main eﬀects, for the purposes of variable screening. In this work, we propose a new method we call Backtracking, for incorporating a similar model building strategy to that of MARS and i FORT into methods based on sparsityinducing penalty functions. Though greedy forward selection methods often work well, penalty function-based methods such as the Lasso can be more stable (see Efron et al. (2004)) and oﬀer a useful alternative.

1.2 Outline of the Idea

When used with the Lasso, Backtracking begins by computing the Lasso solution path, decreasing λ from . A second solution path, P2, is then produced, where the design matrix contains all main eﬀects, and also the interaction between the ﬁrst two active variables in the initial path. Continuing iteratively, subsequent solution paths P3, . . . , PT are computed where the set of main eﬀects and interactions in the design matrix for the kth path is determined based on the previous path Pk 1. Thus if in the third path, a key interaction was included and so variable selection was then more accurate, the selection of interactions for all future paths would beneﬁt. In this way information is used as soon as it is available, rather than at discrete stages as with the method of Bickel et al. (2010). In addition, if all important interactions have already been included by P3, we have a solution path unhindered by the addition of further spurious interactions. It may seem that a drawback of our proposed approach is that the computational cost of producing all T solution paths will usually be unacceptably large. However, computation of the full collection of solution paths is typically very fast. This is because rather than computing each of the solution paths from scratch, for each new solution path Pk+1, we ﬁrst track along the previous path Pk to ﬁnd where Pk+1 departs from Pk. This is the origin of the name Backtracking. Typically, checking whether a given trial solution is on a solution path requires much less computation than calculating the solution path itself, and so this Backtracking step is rather quick. Furthermore, when the solution paths do separate, the tail portions of the paths can be computed in parallel. An R (R Development Core Team, 2005) package for the method is available on the author s website.

1.3 Organisation of the Paper

The rest of the paper is organised as follows. In Section 2 we describe an example which provides some motivation for our Backtracking method. In Section 3 we develop our method in the context of the Lasso for the linear model. In Section 4, we describe how our method can be extended beyond the case of the Lasso for the linear model. In Section 5 we report the results of some simulation experiments and real data analyses that demonstrate the eﬀectiveness of Backtracking. Finally, in Section 6, we present some theoretical results which aim to give a deeper understanding of the way in which Backtracking works. Proofs are collected in the appendix.

2. Motivation

In this section we introduce a toy example where approaches that select candidate interactions based on selected main eﬀects will tend to perform poorly. We consider a linear model with interactions involving a design matrix X Rn p with n = 200, p = 500 and where

j=1 βj Xij +β7Xi1Xi2 +β8Xi3Xi4 +β9Xi5Xi6 +εi, εi N(0, σ2), i = 1, . . . , n. (2)

We take X with i.i.d. rows having a distribution such that Xi5 is uncorrelated with {Xij : j = 5}. We then choose β1, . . . , β9 in such a way that Xi5 is also uncorrelated with the response yet β5 = 0. The precise construction is detailed in the appendix. In order to select variable 5 using that Lasso, we would need to have already selected some important interactions. Thus if we ﬁrst select important main eﬀects using the Lasso, for example, it is very unlikely that variable 5 will be selected. Then if we add all two-way interactions between the selected variables and ﬁt the Lasso once more, the interaction between variables 5 and 6 will not be included. Of course, one can again add interactions between selected variables and compute another Lasso ﬁt, and then there is a chance the interaction will be selected. Thus it is very likely that at least three Lasso ﬁts will be needed in order to select the right variables. Figure 1a shows the result of applying the Lasso to data generated according to (2), σ chosen to give a signal-to-noise ratio (SNR) of 4, and

β = ( 1.25, 0.75, 0.75, 0.5, 2, 1.5, 2, 2, 1)T .

As expected, we see variable 5 is nowhere to be seen and instead many unwanted variables are selected as λ is decreased. Figure 1b illustrates the eﬀect of including all p(p 1)/2 possible interactions in the design matrix. Even in our rather moderate-dimensional situation, we are not able to recover the true signal. Though all the true interaction terms are selected, now neither variable 4 nor variable 5 are present in the solution paths and many false interactions are selected. Although this example is rather contrived, it illustrates how sometimes the right interactions need to be augmented to the design matrix in order for certain variables to be selected. Even when interactions are only present if the corresponding main eﬀects are too, main eﬀects can be missed by a procedure that does not consider interactions. In fact, we can see the same phenomenon occurring when the design matrix has i.i.d. Gaussian entries (see Section 5.1). Thus multiple Lasso ﬁts might be needed to have any chance of selecting the right model. This raises the question of which tuning parameters to use in the multiple Lasso ﬁts. One option, which we shall refer to as the iterated Lasso, is to select tuning parameters by cross-validation each time. A drawback of this approach, though, is that the number of interactions to add can be quite large if cross-validation chooses a large active set. This is often the case when the presence of interactions makes some important main eﬀects hard to distinguish from noise variables in the initial Lasso ﬁt. Then cross-validation may choose a low λ in order to try to select those variables, but this would result in many noise variables also being included in the active set.

Modelling Interactions

We take an alternative approach here and include suspected interactions in the design matrix as soon as possible. That is, if we progress along the solution path from λ = , and two variables enter the model, we immediately add their interaction to the design matrix and start computing the Lasso again. We could now disregard the original path, but there is little to lose, and possibly much to gain, in continuing the original path in parallel with the new one. We can then repeat this process, adding new interactions when necessary, and restarting the Lasso, whilst still continuing all previous paths in parallel. We show in the next section how computation can be made very fast since many of these solution paths will share the same initial portions.

3. Backtracking with the Lasso

In this section we introduce a version of the Backtracking algorithm applied to the Lasso (1). First, we present a naive version of the algorithm, which is easy to understand. Later in Section 3.2, we show that this algorithm performs a large number of unnecessary calculations, and we give a far more eﬃcient version.

3.1 A Naive Algorithm

As well as a base regression procedure, the other key ingredient that Backtracking requires is a way of suggesting candidate interactions based on selected main eﬀects, or more generally a way of suggesting higher-order interactions based on lower-order interactions. In order to discuss this and present our algorithm, we ﬁrst introduce some notation concerning interactions. Let X be the original n p design matrix, with no interactions. In order to consider interactions in our models, rather than indexing variables by a single number j, we use subsets of {1, . . . , p}. Thus by variable {1, 2}, we mean the interaction between variables 1 and 2, or in our new notation, variables {1} and {2}. When referring to main eﬀects {j} however, we will often omit the braces. As we are using the Lasso as the base regression procedure here, interaction {1, 2} will be the componentwise product of the ﬁrst two columns of X. We will write Xv Rn for variable v. The choice of whether and how to scale and centre interactions and main eﬀects can be a rather delicate one, where domain knowledge may play a key role. In this work, we will centre all main eﬀects, and scale them to have ℓ2-norm n. The interactions will be created using these centred and scaled main eﬀects, and they themselves will also be centred and scaled to have ℓ2-norm n. For C a set of subsets of {1, . . . , p} we can form a modiﬁed design matrix XC, where the columns of XC are given by the variables in C, centred and scaled as described above. Thus C is the set of candidate variables available for selection when design matrix XC is used. This subsetting operation will always be taken to have been performed before any further operations on the matrix, so in particular XT C means (XC)T . We will consider all associated vectors and matrices as indexed by variables, so we may speak of component {1, 2} of β, denoted β{1,2}, if β were multiplying a design matrix which included {1, 2}. Further, for any collection of variables A, we will write βA for the subvector whose components are those indexed by A. To represent an arbitrary variable which may be an interaction, we shall often use v or u and reserve j to index main eﬀects.

(a) Main eﬀects only

(b) All interactions added

(c) Step 3: {1, 2}, {2, 6}. {1, 6} added in step 2.

(d) Step 4: {1, 3}, {2, 3}, {3, 6}.

(e) Step 5: {1, 4}, {2, 4}, {3, 4}, {4, 6}.

(f) Step 6: {1, 5}, {2, 5}, {3, 5}, {4, 5}, {5, 6}.

Figure 1: For data generated as described in Section 2, the coeﬃcient paths against λ of the Lasso with main eﬀects only, (a); the Lasso with all interactions added, (b); and Backtracking with k = 3, . . . , 6, ((c) (d)); when applied to the example in Section 2. Below the Backtracking solution paths we give Ck \ Ck 1: the interactions which have been added in the current step. The solid red, green, yellow, blue, cyan and magenta lines trace the coeﬃcients of variables 1, . . . , 6 respectively, with the alternately coloured lines representing the corresponding interactions. The dotted blue and red coeﬃcient paths indicate noise main eﬀect ( NM ) and interaction ( NI ) terms respectively. Vertical dotted black and dashed grey lines give the values of λstart k and λadd k respectively.

Modelling Interactions

We will often need to express the dependence of the Lasso solution ˆβ (1) on the tuning parameter λ and the design matrix used. We shall write ˆβ(λ, C) when XC is the design matrix. We will denote the set of active components of a solution ˆβ by A(ˆβ) = {v : ˆβv = 0}. We now introduce a function I that given a set of variables A, suggests a set of interactions to add to the design matrix. The choice of I we use here is as follows:

I(A) = {v {1, . . . , p} : for all u v, u = , u A}.

In other words, I(A) is the set of variables not in A, all of whose corresponding lower order interactions are present in A. To ease notation, when A contains only main effects j1, . . . , js, we will write I(j1, . . . , js) = I(A). For example, I(1, 2) = {{1, 2}}, and I(1, 2, 3) = {{1, 2}, {2, 3}, {1, 3}}. Note {1, 2, 3} / I(1, 2, 3) as the lower order interaction {1, 2} of {1, 2, 3} is not in {{1}, {2}, {3}}, for example. Other choices for I can be made, and we discuss some further possibilities in Section 4. Backtracking relies on a path algorithm for computing the Lasso on a grid of λ values λ1 > λL. Several algorithms are available and coordinate descent methods (Friedman et al., 2010) appears to work well in practice. We are now in a position to introduce a naive version of our Backtracking algorithm applied to the Lasso (Algorithm 1). We will assume that the response Y is centred in addition to the design matrix, so no intercept term is necessary.

Algorithm 1 A naive version of Backtracking with the Lasso Set T to be the (given) maximum number of candidate interaction sets to generate. Let the initial candidate set consist of just main eﬀects: C1 = {{1}, . . . , {p}}. Set the index for the candidate sets k = 1. Let λstart 1 = λ1, the largest λ value on the grid. In the steps which follow, we maintain a record of the set of variables which have been non-zero at any point in the algorithm up to the current point (an ever active set , A).

1. Compute the solution path of the Lasso with candidate set Ck from λstart k onwards until the ever active set A has I(A) Ck (if the smallest λ value on the grid is reached then go to 5). Let the λ value where this occurs be λadd k . We will refer to this solution path as Pk.

2. Set Ck+1 = Ck I(A) so the next candidate set contains all interactions between variables in the ever active set.

3. Set λstart k+1 = λ1.

4. Increment k. If k > T go to 5, otherwise go back to 1.

5. For each k complete the solution path Pk by continuing it until λ = λL. Computing these ﬁnal pieces of the solution paths can be done in parallel.

The algorithm computes Lasso solution paths whose corresponding design matrices include interactions chosen based on previous paths. The quantity λadd k records the value of λ at which interaction terms were added to the set of candidates Ck. Here λstart k is a redundant quantity and can be replaced everywhere with λ1 to give the same algorithm.

We include it at this stage though to aid with the presentation of an improved version of the algorithm where λstart k in general takes values other than λ1. We note that the ﬁnal step of completing the solution paths can be carried out as the initial paths are being created, rather than once all initial paths have been created. Though here the algorithm can include three-way or even higher order interactions, it is straightforward to restrict the possible interactions to be added to ﬁrst-order interactions, for example.

3.2 An Improved Algorithm

The process of performing multiple Lasso ﬁts is computationally cumbersome, and an immediate gain in eﬃciency can be realised by noticing that the ﬁnal collection of solution paths is in fact a tree of solutions: many of the solution paths computed will share the same initial portions. To discuss this, we ﬁrst recall the KKT conditions for the Lasso dictate that ˆβ is a solution to (1) when the design matrix is XC if and only if

1 n XT v (Y XC ˆβ) = λsgn(ˆβv) for ˆβv = 0 (3)

1 n|XT v (Y XC ˆβ)| λ for ˆβv = 0. (4)

Note the ˆµXT v 1 term vanishes as the columns of XC are centred. We see that if for some λ

1 n XT Ck+1\Ck(Y XCk ˆβ(λ, Ck)) λ, (5)

ˆβCk+1\Ck(λ, Ck+1) = 0, ˆβCk(λ, Ck+1) = ˆβ(λ, Ck).

Thus given solution path Pk, we can attempt to ﬁnd the smallest λl such that (5) holds. Up to that point then, path Pk+1 will coincide with Pk and so those Lasso solutions need not be re-computed. Note that verifying (5) is a computationally simple task requiring only O(|Ck+1 \ Ck|n) operations. Our ﬁnal Backtracking algorithm therefore replaces step 3 of Algorithm 1 with the following:

3a. Find the smallest λ1 λl λadd k such that (5) holds with λ = λl and set this to be λstart k+1 . If no such λl exists, set λstart k+1 to be λ1.

Figures 1c 1f show steps 3 6 (i.e. k = 3, . . . , 6) of Backtracking applied to the example described in Section 2. Note that Figure 1a is in fact step 1. Step 2 is not shown as the plot looks identical to that in Figure 1a. We see that when k = 6, we have a solution path where all the true variable and interaction terms are active before any noise variables enter the coeﬃcient plots. We can further speed up the algorithm by ﬁrst checking if Pk coincides with Pk+1 at λadd k . If not, we can perform a bisection search to ﬁnd any point where Pk and Pk+1 agree, but after which they disagree. This avoids checking (5) for every λl up to λadd k . We will work with the simpler version of Backtracking here using step 3a, but use this faster version in our implementation.

Modelling Interactions

4. Further Applications of Backtracking

Our Backtracking algorithm has been presented in the context of the Lasso for the linear model. However, the real power of the idea is that it can be incorporated into any method that produces a path of increasingly complex sparse solutions by solving a family of convex optimisation problems parametrised by a tuning parameter. For the Backtracking step, the KKT conditions for these optimisation problems provide a way of checking whether a given trial solution is an optimum. As in the case of the Lasso, checking whether the KKT conditions are satisﬁed typically requires much less computational eﬀort than computing a solution from scratch. Below we brieﬂy sketch some applications of Backtracking to a few of the many possible methods with which it can be used.

4.1 Multinomial Regression

An example, which we apply to real data in Section 5.2, is multinomial regression with a group Lasso (Yuan and Lin, 2006) penalty. Consider n observations of a categorical response that takes J levels, and p associated covariates. Let Y be the indicator response matrix, with ijth entry equal to 1 if the ith observation takes the jth level, and 0 otherwise. We model

P(Yij = 1) := Πij(µ , β ; XS ) := exp µ j + XS β j

PJ j =1 exp µ j + XS β j

Here µ is a vector of intercept terms and β is a |S | J matrix of coeﬃcients; β j denotes the jth column of β . This model is over-parametrised, but regularisation still allows us produce estimates of µ and β and hence also of Π (see Friedman et al. (2010)). When our design matrix is XC, these estimates are given by (ˆµ, ˆβ) := arg min µ,β Q(µ, β; λ) where

Q(µ, β; λ) := 1

j=1 YT j (µj1+XCβj) 1

j=1 exp(µj1 + XCβj

v C (βT )v 2.

The functions log and exp are to be understood as applied componentwise and the rows of β are indexed by elements of C. To derive the Backtracking step for this situation, we turn to the KKT conditions which characterise the minima of Q:

1 n{YT ΠT (ˆµ, ˆβ; XC)}1 = 0,

1 n{YT ΠT (ˆµ, ˆβ; XC)}Xv = λ (ˆβ T )v

(ˆβ T )v 2 for (ˆβ T )v = 0,

1 n {YT ΠT (ˆµ, ˆβ; XC)}Xv 2 λ for (ˆβ T )v = 0.

Thus, analogously to (5), for D C, (ˆβ T (λ, D))D\C = 0 and (ˆβ T (λ, D))C = ˆβ T (λ, C) if and only if max v D\C

1 n {YT ΠT (ˆµ(λ, C), ˆβ(λ, C); XC)}Xv 2 λ.

4.2 Structural Sparsity

Although in our Backtracking algorithm, interaction terms are only added as candidates for selection when all their lower order interactions and main eﬀects are active, this hierarchy in the selection of candidates does not necessarily follow through to the ﬁnal model: one can have ﬁrst-order interactions present in the ﬁnal model without one or more of their main eﬀects, for example. One way to enforce the hierarchy constraint in the ﬁnal model is to use a base procedure which obeys the constraint itself. Examples of such base procedures are provided by the Composite Absolute Penalties (CAP) family (Zhao et al., 2009). Consider the linear regression setup with interactions. For simplicity we only describe Backtracking with ﬁrst-order interactions. Let C be the candidate set and let I = C \ C1 be the (ﬁrst-order) interaction terms in C. In order to present the penalty, we borrow some notation from Combinatorics. Let C(r) 1 denote the set of r-subsets of C1. For A C(r) 1 and r 1, deﬁne

l(A) = {v C(r 1) 1 : v u for some u A}

u(A) = {v C(r+1) 1 : v u for some u A}

These are known as the lower shadow and upper shadow respectively (Bollob as, 1986). Our objective function Q is given by

Q(µ, β) = 1 2n Y µ1 XCβ 2 2 + λ βC1\ l(I) 1 + λ X

v l(I) β{v} ( u({v}) I) γ + λ βI 1,

where γ > 1. For example, if C = {{1}, . . . , {4}, {1, 2}, {2, 3}}, then omitting the factor of λ, the penalty terms in Q are

|β4| + (β1, β{1,2})T γ + (β2, β{1,2}, β{2,3})T γ + (β3, β{2,3})T γ + |β{1,2}| + |β{2,3}|.

The form of this penalty forces interactions to enter the active set only after or with their corresponding main eﬀects. The KKT conditions for this optimisation take a more complicated form than those for the Lasso. Nevertheless, checking they hold for a trial solution is an easier task than computing a solution.

4.3 Nonlinear Models

If a high-dimensional additive modelling method (Ravikumar et al., 2009; Meier et al., 2009) is used as the base procedure, it is possible to ﬁt nonlinear models with interactions. Here each variable is a collection of basis functions, and to add an interaction between variables, one adds the tensor product of the two collections of basis functions, penalizing the new interaction basis functions appropriately. Structural sparsity approaches can also be used here. The VANISH method of Radchenko and James (2010) uses a CAP-type penalty in nonlinear regression, and this can be used as a base procedure in a similar way to that sketched above.

Modelling Interactions

4.4 Introducing more Candidates

In our description of the Backtracking algorithm, we only introduce an interaction term when all of its lower order interactions and main eﬀects are active. Another possibility, in the spirit of MARS (Friedman, 1991), is to add interaction terms when any of their lower order interactions or main eﬀects are active. As at the kth step of Backtracking, there will be roughly kp extra candidates, an approach that can enforce the hierarchical constraint may be necessary to allow main eﬀects to be selected from amongst the more numerous interaction candidates. The key point to note is that if the algorithm is terminated after T steps, we are having to deal with roughly at most Tp variables rather than O(p2), the latter coming from including all ﬁrst-order interactions. Another option proposed by a referee is to augment the initial set of candidates with interactions selected through a simple marginal screening step. If only pairwise interactions are considered here, then this would require O(p2n) operations. Though this would be infeasible for very large p, for moderate p this would allow important interactions whose corresponding main eﬀects are not strong to be selected.

5. Numerical Results

In this section we evaluate the performance of Backtracking on both simulated and real data sets.

5.1 Simulations

Here we consider ﬁve numerical studies designed to demonstrate the eﬀectiveness of Backtracking with the Lasso and also highlight some of the drawbacks of using the Lasso with main eﬀects only, when interactions are present. In each of the ﬁve scenarios, we generated 200 design matrices with n = 250 observations and p = 1000 covariates. The rows of the design matrices were sampled independently from Np(0, Σ) distributions. The covariance matrix Σ was chosen to be the identity in all scenarios except scenario 2, where

Σij = 0.75 ||i j| p/2|+p/2.

Thus in this case, the correlation between the components decays exponentially with the distance between them in Z/p Z. We created the responses according to the linear model with interactions and set the intercept to 0:

Y = XS β S + ε, εi i.i.d. N(0, σ2). (6)

The error variance σ2 was chosen to achieve a signal-to-noise ratio (SNR) of either 2 or 3. The set of main eﬀects in S , S 1, was 1, . . . , 10. The subset of variables involved in interactions was 1, . . . , 6. The set of ﬁrst-order interactions in S chosen in the diﬀerent scenarios, S 2, is displayed in Table 1, and we took S = S 1 S 2 so S contained no higher order interactions. In each simulation run, β S 1 was ﬁxed and given by

(2, 1.5, 1.25, 1, 1, 1, 1, 1, 1, 1)T .

Scenario S 2 1 2 3 {{1, 2}, {3, 4}, {5, 6}} 4 {{1, 2}, {1, 3}, . . . , {1, 6}} 5 I(1, 2, 3) I(4, 5, 6)

Table 1: Simulation settings.

Each component of β S 2 was chosen to be q

β S 1 2 2/ |S 1|. Thus the squared magnitude of the interactions was equal to average of the squared magnitudes of the main eﬀects.

In all of the scenarios, we applied four methods: the Lasso using only the main eﬀects; iterated Lasso ﬁts; marginal screening for interactions followed by the Lasso; and the Lasso with Backtracking. Note that due to the size of p in these examples, most of the methods for ﬁnding interactions in lower-dimensional data discussed in Section 1, are computationally impractical here.

For the iterated Lasso ﬁts, we repeated the following process. Given a design matrix, ﬁrst ﬁt the Lasso. Then apply 5-fold cross-validation to give a λ value and associated active set. Finally add all interactions between variables in this active set to the design matrix, ready for the next iteration. For computational feasibility, the procedure was terminated when the number of variables in the design matrix exceeded p + 250 249/2.

With the marginal screening approach, we selected the 2p interactions with the largest marginal correlation with the response and added them to the design matrix. Then a regular Lasso was performed on the augmented matrix of predictors.

Additionally, in scenarios 3 5, we applied the Lasso with all main eﬀects and only the true interactions. This theoretical Oracle approach provided a gold standard against which to test the performance of Backtracking.

We used the procedures mentioned to yield active sets on which we applied OLS to give a ﬁnal estimator. To select the tuning parameters of the methods we used crossvalidation randomly selection 5 folds but repeating this a total of 5 times to reduce the variance of the cross-validation scores. Thus for each λ value we obtained an estimate of the expected prediction error that was an average over the observed prediction errors on 25 (overlapping) validation sets of size n/5 = 50. Note that for both Backtracking and the iterated Lasso, this form of cross-validation chose not just a λ value but also a path rank. When using Backtracking, the size of the active set was restricted to 50 and the size of Ck to p + 50 49/2 = 1225, so T was at most 50.

In scenarios 1 and 2, the results of the methods were almost indistinguishable except that the screening approach performed far worse in scenario 1 where it tended to select several false interactions which in turn hampered the selection of main eﬀects and resulted in a much larger prediction error.

The results of scenarios 3 5, where the signal contains interactions, are more interesting and given in Table 2. For each scenario, method and SNR level, we report 5 statistics. L2sq is the expected squared distance of the signal f and our prediction functions ˆf based

Modelling Interactions

on training data (Ytrain, Xtrain), evaluated at a random independent test observation xnew:

Exnew,Ytrain,Xtrain(f {xnew) ˆf(xnew; Ytrain, Xtrain)}2.

FP Main and FP Inter are the numbers of noise main eﬀects and noise interaction terms respectively, incorrectly included in the ﬁnal active set. FN Main and FN Inter are the numbers of true main eﬀects and interaction terms respectively, incorrectly excluded from the ﬁnal active set. For all the statistics presented, lower numbers are to be preferred. However, the higher number of false selections incurred by both Backtracking and the Oracle procedure compared to using the main eﬀects only or iterated Lasso ﬁts, is due to the model selection criterion being the expected prediction error. It should not be taken as an indication that the latter procedures are performing better in these cases. Backtracking performs best out of the four methods compared here. Note that under all of the settings, iterated Lasso ﬁts incorrectly selects more interaction terms than Backtracking. We see that the more careful way in which Backtracking adds candidate interactions, helps here. Unsurprisingly, ﬁtting the Lasso on just the main eﬀects performs rather poorly in terms of predictive performance. However, it also fails to select important main eﬀects; Backtracking and Iterates have much lower main eﬀect false negatives. The screening approach appears to perform worst here. This is partly because it is not making use of the fact that in all of the examples considered, the main eﬀects involved in interactions are also informative. However, its poor performance is also due the fact that too many false interactions are added to the design matrix after the screening stage. Reducing the number added may help to improve results, but choosing the number of interactions to include via cross-validation, for example, would be computationally costly, unless a Backtracking-type strategy of the sort introduced in this paper were used. We also note that for very large p, marginal screening of interactions would be infeasible due to the quadratic scaling in complexity with p.

5.2 Data Analyses

In this section, we look at the performance of Backtracking using two base procedures, the Lasso for the linear model and the Lasso for multinomial regression, on a regression and a classiﬁcation data set. As competing methods, we consider simply using the base procedures ( Main ), iterated Lasso ﬁts ( Iterated ), Lasso following marginal screening for interactions ( Screening ), Random Forests (Breiman, 2001), hier Net (Bien et al., 2013) and MARS (Friedman, 1991) (implemented using Hastie et al. (2013)). Note that we do not view the latter two methods as competitors of Backtracking, as they are designed for use on lower dimensional data sets than Backtracking is capable of handling. However, it is still interesting to see how the methods perform on data of dimension that is perhaps approaching the upper end of what is easily manageable for methods such as hier Net and MARS, but at the lower end of what one might use Backtracking on. Below we describe the data sets used which are both from the UCI machine learning repository (Asuncion and Newman, 2007).

SNR = 2 SNR = 3

Scenario Statistic Main Iterate Screening

Backtracking Oracle Main Iterate Screening

Backtracking Oracle

L2-sq 6.95 1.40 12.87 1.21 0.82 5.67 0.27 9.24 0.27 0.18 FP Main 3.18 2.43 0.01 2.89 3.19 1.91 0.65 0.00 0.73 0.79 FN Main 1.26 0.38 7.24 0.24 0.14 0.52 0.05 5.14 0.04 0.01 FP Inter 0.00 0.93 11.05 0.45 0.00 0.00 0.27 13.57 0.12 0.00 FN Inter 3.00 0.18 2.06 0.14 0.01 3.00 0.03 1.39 0.04 0.00

L2-sq 12.05 3.25 17.68 2.72 1.68 10.44 0.63 15.19 0.41 0.31 FP Main 2.22 3.88 0.02 5.34 7.05 2.58 1.80 0.04 2.08 2.21 FN Main 3.12 0.90 8.13 0.61 0.26 1.77 0.11 6.94 0.04 0.00 FP Inter 0.00 2.50 12.33 0.77 0.00 0.00 1.77 17.90 0.28 0.00 FN Inter 5.00 0.66 4.07 0.51 0.08 5.00 0.08 3.39 0.03 0.00

L2-sq 14.12 5.08 19.96 4.52 2.14 12.84 1.56 16.99 1.17 0.44 FP Main 3.07 4.75 0.02 5.87 8.57 3.43 3.01 0.05 3.23 3.77 FN Main 3.20 1.26 8.26 0.98 0.33 2.35 0.25 7.00 0.19 0.02 FP Inter 0.00 3.28 17.97 0.87 0.00 0.00 3.05 21.92 0.55 0.00 FN Inter 6.00 1.34 5.00 1.23 0.14 6.00 0.39 4.14 0.30 0.00

Table 2: Simulation results.

5.2.1 Communities and Crime

This data set available at http://archive.ics.uci.edu/ml/datasets/Communities+and+ Crime+Unnormalized contains crime statistics for the year 1995 obtained from FBI data, and national census data from 1990, for various towns and communities around the USA. We took violent crimes per capita as our response: violent crime being deﬁned as murder, rape, robbery, or assault. The data set contains two diﬀerent estimates of the populations of the communities: those from the 1990 census and those from the FBI database in 1995. The latter was used to calculate our desired response using the number of cases of violent crimes. However, in several cases, the FBI population data seemed suspect and we discarded all observations where the maximum of the ratios of the two available population estimates diﬀered by more than 1.25. In addition, we removed all observations that were missing a response and several variables for which the majority of values were missing. This resulted in a data set with n = 1903 observations and p = 101 covariates. The response was scaled to have empirical variance 1.

5.2.2 ISOLET

This data set consists of p = 617 features based on the speech waveforms generated from utterances of each letter of the English alphabet. The task is to learn a classiﬁer which can determine the letter spoken based on these features. The data set is available from http://archive.ics.uci.edu/ml/datasets/ISOLET; see Fanty and Cole (1991) for more background on the data. We consider classiﬁcation on the notoriously challenging E-set consisting of the letters B , C , D , E , G , P , T , V and Z (pronounced zee ). As there were 150 subjects and each spoke each letter twice, we have n = 2700 observations spread equally among 9 classes. The dimension of this data is such that MARS and hier Net could not be applied.

Modelling Interactions

5.3 Methods and Results

For the Communities and crime data set, we used the Lasso for the linear model as the base regression procedure for Backtracking and Iterates. Since the per capita violent crime response was always non-negative, the positive part of the ﬁtted values was taken. For Main, Backtracking, Iterates, Screening and hier Net, we employed 5-fold cross-validation with squared error loss to select tuning parameters. For MARS we used the default settings for pruning the ﬁnal ﬁts using generalised cross-validation. With Random Forests, we used the default settings on both data sets. For the classiﬁcation example, penalised multinomial regression was used (see Section 4.1) as the base procedure for Backtracking and Iterates, and the deviance was used as the loss function for 5-fold cross-validation. In all of the methods except Random Forests, we only included ﬁrst-order interactions. When using Backtracking, we also restricted the size of Ck to p + 50 49/2 = p + 1225. To evaluate the procedures, we randomly selected 2/3 for training and the remaining 1/3 was used for testing. This was repeated 200 times for each of the data sets. Note that we have speciﬁcally chosen data sets with n large as well as p large. This is to ensure that comparisons between the performances of the methods can be made with more accuracy. For the regression example, out-of-sample squared prediction error was used as a measure of error; for the classiﬁcation example, we used out-of-sample misclassiﬁcation error with 0 1 loss. The results are given in Table 3. Random Forests has the lowest prediction error on the regression data set, with Backtracking not far behind, whilst Backtracking wins in the classiﬁcation task, and in fact achieves strictly lower misclassiﬁcation error than all the other methods on 90% of all test samples. Note that a direct comparison with Random Forests is perhaps unfair, as the latter is a black-box procedure whereas Backtracking is aiming for a more interpretable model. MARS performs very poorly indeed on the regression data set. The enormous prediction error is caused by the fact that whenever observations corresponding to either New York or Los Angeles were in the test set, MARS predicted their responses to be far larger than they were. However, even with these observations removed, the instability of MARS meant that it was unable to give much better predictions than an intercept-only model. Hier Net performs well on this data set, though it is worth noting that we had to scale the interactions to have the same ℓ2-norm as the main eﬀects to get such good results (the default scaling produced error rates worse than that of an intercept-only model). Backtracking does better here. One reason for this is that the because the main eﬀects are reasonably strong in this case, a low amount of penalisation works well. However, because with hier Net, the penalty on the interactions is coupled with the penalty on the main eﬀects, the ﬁnal model tended to include close to two hundred interaction terms. The Screening approach similarly suﬀers from including too many interactions and performs only a little better than a main eﬀects only ﬁt. The way that Backtracking creates several solution paths with varying numbers of interaction terms means that it is possible to ﬁt main eﬀects and a few interactions using a low penalty without this low penalisation opening the door to many other interaction terms. The iterated Lasso approach also has this advantage, but as the number of interactions are increased in discrete stages, it can miss a candidate set with the right number of interactions that may be picked up by the more continuous model building process used

Error Method Communities and crime ISOLET Main 000000.414 (6.5 10 3) 00000.0641 (4.7 10 4) Iterate 000000.384 (5.9 10 3 00000.0641 (4.7 10 4) Screening 000000.390 (7.8 10 3) - Backtracking 000000.365 (3.7 10 3) 00000.0563 (4.5 10 4) Random Forest 000000.356 (2.4 10 3) 00000.0837 (6.0 10 4) hier Net 000000.373 (4.7 10 3) - MARS 005580.586 (3.1 103) -

Table 3: Real data analyses results. Average error rates over 200 training testing splits are given, with standard deviations of the results divided by

200 in parentheses.

by Backtracking. This occurs in a rather extreme way with the ISOLET data set where, since in the ﬁrst stage of the iterated Lasso, cross-validation selected far too many variables (> 250), the second and subsequent steps could not be performed. This is why the results are identical to using the main eﬀects alone.

6. Theoretical Properties

Our goal in this section is to understand under what circumstances Backtracking with the Lasso can arrive at a set of candidates, C , that contains all of the true interactions, and only a few false interactions. On the event on which this occurs, we can then apply many of the existing results on the Lasso, to show that the solution path ˆβ(λ, C ) has certain properties. As an example, in Section 6.2 we give suﬃcient conditions for the existence of a λ such that {v : ˆβv(λ , C ) = 0} equals the true set of variables. We work with the normal linear model with interactions,

Y = µ 1 + XS β S + ε, (7)

where εi i.i.d. N(0, σ2), and to ensure identiﬁability, XS has full column rank. We will assume that S = S 1 S 2, where S 1 and S 2 are main eﬀects and two-way interactions respectively. Let the interacting main eﬀects be I ; formally, I is the smallest set of main eﬀects such that I(I ) S 2. Assume I S 1 so interactions only involve important main eﬀects. Let sl = |S l |, l = 1, 2 and set s = s1 + s2. Deﬁne C = C1 I(S 1). Note that C

contains S but not additional interactions from any variables from C1 \ S 1. Although the Backtracking algorithm was presented for a base path algorithm that computed solutions at only discrete values, for the following results, we need to imagine an idealised algorithm which computes the entire path of solutions. In addition, we will assume that we only allow ﬁrst-order interactions in the Backtracking algorithm, and that T s1. We ﬁrst consider the special case where the design matrix is derived from a random matrix with i.i.d. multivariate normal rows, before describing a result for ﬁxed design.

Modelling Interactions

6.1 Random Normal Design

Let the random matrix Z have independent rows distributed as Np(0, Σ). Suppose that XC1, the matrix of main eﬀects, is formed by scaling and centring Z. We consider an asymptotic regime where X, f , S , σ2 and p can all change as n , though we will suppress their dependence on n in the notation. Furthermore, for sets of indices S, M {1, . . . , p}, let ΣS,M R|S| |M| denote the submatrix of Σ formed from those rows and columns of Σ indexed by S and M respectively. For any positive semi-deﬁnite matrix A, we will let cmin(A) and cmax(A) denote its minimal and maximal eigenvalues respectively. For sequences an, bn, by an bn we mean bn = o(an). We make the following assumptions.

A1. cmin(ΣS 1,S 1) c > 0.

A2. supτ Rs1: τ 1 ΣN,S 1Σ 1 S 1,S 1τ δ < 1.

A3. s4 1 log(p)/n 0 and s8 1 log(s1)2/n 0.

min j I |β j | s1(σ log p + s1 + log p) n +

s3 1 log(s1)

A5. β S 2 2 is bounded as n and cmax(ΣS 1,S 1) c < .

A1 is a standard assumption in high-dimensional regression and is, for example, implied by the compatibility constant of B uhlmann and van de Geer (2011a) being bounded away from zero. A2 is closely related to irrepresentable conditions (see Meinshausen and B uhlmann (2006), Zhao and Yu (2006), Zou (2006), B uhlmann and van de Geer (2011a), Wainwright (2009)), which are used for proving variable selection consistency of the Lasso. Note that although here the signal may contain interactions, our irrepresentable-type condition only involves main eﬀects. A3 places restrictions on the rates at which s1 and p can increase with n. The ﬁrst condition involving log(p) is somewhat natural as s2 1 log(p)/n 0 would typically be required in order to show ℓ1 estimation consistency of β where only s1 main eﬀects are present; here our eﬀective number of variables is s1 s s2 1. The second condition restricts the size of s1 more stringently but is nevertheless weaker than equivalent conditions in Hao and Zhang (2014). A4 is a minimal signal strength condition. The term involving σ is the usual bound on the signal strength required in results on variable selection consistency with the Lasso when there are s2 1 non-zero variables. Due to the presence of interactions, the terms not involving σ place additional restrictions on the sizes of non-zero components of β even when σ = 0. A5 ensures that the model is not too heavily misspeciﬁed in the initial stages of the algorithm, where we are regressing on only main eﬀects. The following theorem states that given the assumptions above, with probability tending to 1 we are guaranteed a candidate set will be produced by our algorithm which contains all true interactions and no interactions involving a noise variable.

Theorem 1 Assuming A1 A5, the probability that there exists a k such that C Ck S tends to 1 as n .

6.2 Fixed Design

The result for a random normal design above is based on a corresponding result for ﬁxed design which we present here. In order for Backtracking not to add any interactions involving noise variables, to begin with, one pair of interacting signal variables must enter the solution path before any noise variables. Other interacting signal variables need only become active after the interaction between this ﬁrst pair has become active. Thus we need that there is some ordering of the interacting variables where each variable only requires interactions between those variables earlier in the order to be present before it can become active. Variables early on in the order must have the ability to be selected when there is serious model misspeciﬁcation as few interaction terms will be available for selection. Variables later in the order only need to have the ability to be selected when the model is approximately correct. Note that a signal variable having a coeﬃcient large in absolute value does not necessarily ensure that it becomes active before any noise variable. Indeed, in our example in Section 2, variable 5 did not enter the solution path at all when only main eﬀects were present, but had the largest coeﬃcient. Write f for XS βS , and for a set S such that XS has full column rank, deﬁne βS := (XT SXS) 1XT Sf .

Intuitively what should matter are the sizes of the appropriate coeﬃcients of βS for suitable choices of S. In the next section, we give a suﬃcient condition based on βS for a variable v S to enter the solution path before any variable outside S.

6.2.1 The Entry Condition

Let PS = XS(XT SXS) 1XT S denote orthogonal projection on to the space spanned by the columns of XS. Further, for any two candidate sets S, M that are sets of subsets of {1, . . . , p}, deﬁne

Now given a set of candidates, C, let v S C and write M = C \ S. For η > 0, we shall say that the Ent(v, S, C; η) condition holds if, XS has full column rank, and the following holds,

sup τ S R|S|: τ S 1 ˆΣM,S ˆΣ 1 S,Sτ S < 1, (8)

|βS v | > max u M

n XT u (I PS)f + 2η

1 ˆΣ 1 S,S ˆΣS,{u} 1 + η

( ˆΣ 1 S,S)v 1. (9)

In Lemma 4 given in the appendix, we show that this condition is suﬃcient for variable v to enter the active set before any variable in M, when the set of candidates is C and XT Cε η. In addition, we show that v will remain in the active set at least until some variable from M enters the active set. The second part of the entry condition (9) asserts that coeﬃcient v of the regression of f on XS must exceed a certain quantity that we now examine in more detail. The

Modelling Interactions

1 n XT u (I PS)f term is the sample covariance between Xu, which is one of the columns of XM, and the residual from regressing f on XS. Note that the more of S that S contains, the closer this will be to 0. To understand the ( ˆΣ 1 S,S)v 1 term, without loss of generality take v as {1} and write b = ˆΣS\{v},{v} and D = ˆΣS\{v},S\{v}. For any square matrix ˆΣ, let cmin( ˆΣ) denote its minimal eigenvalue. Using the formula for the inverse of a block matrix and writing s for |S|, we have

( ˆΣ 1 S,S)v 1 =

1 + b T (D bb T ) 1b (D bb T ) 1b

1 + b 2 2 + s 1 b 2

cmin( ˆΣS,S) .

In the ﬁnal line we have used the Cauchy Schwarz inequality and the fact that if w is a unit eigenvector of D bb T with minimal eigenvalue, then

cmin(D bb T ) = ˆΣS,S

2 cmin( ˆΣS,S) q

1 + |b T w |2 cmin( ˆΣS,S).

Thus when variable v is not too correlated with the other variables in S, and so b 2 is small, ( ˆΣ 1 S,S)v 1 will not be too large. Even when this is not the case, we still have the bound

( ˆΣ 1 S,S)v 1

cmin( ˆΣS,S) .

Turning now to the denominator, ˆΣ 1 S,S ˆΣS,{u} 1 is the ℓ1-norm of the coeﬃcient of regression of Xu on XS, and the maximum of this quantity over u M gives the left-hand side of (8). Thus when u is highly correlated with many of the variables in S, ˆΣ 1 S,S ˆΣS,{u} 1 will be large. On the other hand, in this case one would expect (I PS)Xu 2 to be small, and so to some extent the numerator and denominator compensate for each other.

6.2.2 Statement of Results

Without loss of generality assume I = {1, . . . , |I |}. Also let J = {I(A) : A S 1}. Our formal assumption corresponding to the discussion at the beginning of Section 6 is the following.

The entry order condition. There is some ordering of the variables in I , which without loss of generality we take to simply be 1, . . . , |I |, such that for each j I , we have,

For all A J with I(1, . . . , j 1) A I(S 1)

Ent(j, S 1 B, C1 A; η) holds for some A S 2 B A.

η = η(t; n, p, s1, σ) = σ

t2 + 2 log(p + s2 1) n .

First we discuss the implications for variable 1. The condition ensures that whenever the candidate set is enlarged from C1 to also include any set of interactions built from S 1, variable 1 enters the active set before any variable outside I(S 1), and moreover, it remains in the active set at least until a variable outside I(S 1) enters. For j > 2, we see that the enlarged candidate sets for which we require the entry conditions to hold, are fewer in number. Variable |I | only requires the entry condition to hold for candidate sets that at least include I(1, . . . , |I | 1) and thus include almost all of S . What this means is that we require some strong interacting variables, for which when f is regressed onto a variety of sets of variables containing them (some of which contain only a few of the true interaction variables), always have large coeﬃcients. Given the existence of such strong variables, other interacting variables need only have large coeﬃcients when f is regressed onto sets containing them that also include many true interaction terms. Note that the equivalent result for the success of the strategy that simply adds interactions between selected main eﬀects would essentially require all main eﬀect involved in interactions to satisfy the conditions imposed on the variables 1 and 2 here. Going back to the example in Section 2, variable 5 has |βS 5 | 0 for all S {1, . . . , 6}, but |βS 5 | > 0 once {1, 2} S or {3, 4} S.

Theorem 2 Assume the entry order condition holds. With probability at least 1 exp( t2/2), there exists a k such that C Ck S .

The following corollary establishes variable selection consistency under some additional conditions.

Corollary 3 Assume the entry order condition holds. Writing N = C \S , further assume

ˆΣN,S ˆΣ 1 S ,S sgn(β S ) < 1;

and that for all v S ,

|β v| > η sgn(β S )T ( ˆΣ 1 S ,S )v

1 ˆΣN,S ˆΣ 1 S ,S sgn(β S ) + ξ,

ξ = ξ(t; n, s, σ, cmin( ˆΣS ,S )) = σ

t2 + 2 log(s)

ncmin( ˆΣS ,S ) .

Then with probability at least 1 3 exp( t2/2), there exist k and λ such that

A( β(λ , Ck )) = S .

Note that if we were to simply apply the Lasso to the set of candidates Call := C1 I(C1) (i.e. all possible main eﬀects and their ﬁrst-order interactions), we would require an irrepresentable condition of the form

ˆΣNall,S ˆΣ 1 S ,S sgn(β S ) < 1,

where Nall = Call \ S . Thus we would need O(p2) inequalities to hold, rather than our O(p). Of course, we had to introduce many additional assumptions to reach this stage and no set of assumptions is uniformly stronger or weaker than the other. However, our proposed method is computationally feasible.

Modelling Interactions

7. Discussion

While several methods now exist for ﬁtting interactions in moderate-dimensional situations where p is in the order of hundreds, the problem of ﬁtting interactions when the data is of truly high dimension has received less attention.

Typically, the search for interactions must be restricted by ﬁrst ﬁtting a model using only main eﬀects, and then including interactions between those selected main eﬀects, as well as the original main eﬀects, as candidates in a ﬁnal ﬁt. This approach has the drawbacks that important main eﬀects may not be selected in the initial stage as they require certain interactions to be present in order for them to be useful for prediction. In addition, the initial model may contain too many main eﬀects when, without the relevant interactions, the model selection procedure cannot ﬁnd a good sparse approximation to the true model.

The Backtracking method proposed in this paper allows interactions to be added in a more natural gradual fashion, so there is a better chance of having a model which contains the right interactions. The method is computationally eﬃcient, and our numerical results demonstrate its eﬀectiveness for both variable selection and prediction.

From a theoretical point of view we have shown that when used with the Lasso, rather than requiring all main eﬀects involved in interactions to be highly correlated with the signal, Backtracking only needs there to exist some ordering of these variables where those early on in the order are important for predicting the response by themselves. Variables later in the order only need to be helpful for predicting the response when interactions between variables early on in the order are present.

Though in this paper, we have largely focussed on Backtracking used with the Lasso, the method is very general and can be used with many procedures that involve sparsityinducing penalty functions. These methods tend to be some of the most useful for dealing with high-dimensional data, as they can produce stable, interpretable models. Combined with Backtracking, the methods become much more ﬂexible, and it would be very interesting to explore to what extent using non-linear base procedures could yield interpretable models with predictive power comparable to black-box procedures such as Random Forests (Breiman, 2001). In addition, we believe integrating Backtracking with some of the penaltybased methods for ﬁtting interactions to moderate-dimensional data, will prove to be a fruitful direction for future research.

Acknowledgments

I am very grateful to Richard Samworth, for many helpful comments and suggestions.

Appendix A. Construction of X in Section 2

First, consider (Zi1, Zi2, Zi3) generated from a mean zero multivariate normal distribution with Var(Zij) = 1, j = 1, 2, 3, Cov(Zi1, Zi2) = 0 and Cov(Zi1, Zi3) = Cov(Zi2, Zi3) = 1/2. Independently generate Ri1 and Ri2 each of which takes only the values { 1, 1}, each with

probability 1/2. We form the ith row of the design matrix as follows:

Xi1 =Ri1 sgn(Zi1)|Zi1|1/4,

Xi2 =Ri1|Zi1|3/4,

Xi3 =Ri2 sgn(Zi2)|Zi2|1/4,

Xi4 =Ri2|Zi2|3/4,

The remaining Xij, j = 6, . . . , p are independently generated from a standard normal distribution. Note that the random signs Ri1 and Ri2 ensure that Xi5 is uncorrelated with each of Xi1, . . . , Xi4. Furthermore, the fact that Xi1Xi2 = Zi1 and Xi3Xi4 = Zi2, means that when β5 = 1

2(β7 + β8), Xi5 is uncorrelated with the response.

Appendix B. Proofs of Theorem 2 and Corollory 3

In this subsection we use many ideas from Section B of Wainwright (2009) and Section 6 of B uhlmann and van de Geer (2011a).

Lemma 4 Let S C be such that XS has full column rank and let M = C \ S. On the event ΩC,η := { 1

n XT Cε η},

the following hold:

λ > max u M

n|XT u (I PS)f | + 2η

1 ˆΣ 1 S,S ˆΣS,{u} 1

then the Lasso solution is unique and ˆβM(λ, C) = 0.

(ii) If λ is such that for some Lasso solution ˆβM(λ, C) = 0, and for v S,

|βS v | > ( ˆΣ 1 S,S)v 1(λ + η),

then for all Lasso solutions, ˆβv(λ, C) = 0.

(iii) Let λent = sup{λ : λ 0 and for some Lasso solution ˆβM(λ, C) = 0},

where we take sup = 0. If for v S,

|βS v | > max u M

n|XT u (I PS)f | + 2η

1 ˆΣ 1 S,S ˆΣS,{u} 1 + η

( ˆΣ 1 S,S)v 1,

there exists a λ > λent such that the solution ˆβ(λ, C) is unique, and for all λ (λent, λ] and all Lasso solutions ˆβ(λ , C), we have ˆβv(λ , C) = 0.

Modelling Interactions

Proof We begin by proving (i). Suppressing the dependence of ˆβ on λ and C, we can write the KKT conditions ((3), (4)) as

1 n XT C(Y XC ˆβ) = λˆτ,

where ˆτ is an element of the subdiﬀerential ˆβ 1 and thus satisﬁes

ˆτ 1, (11) ˆβv = 0 ˆτv = sgn(ˆβv). (12)

By decomposing Y as PSf +(I PS)f +ε, XC as (XS XM), and noting that XT S(I PS) = 0, we can rewrite the KKT conditions in the following way:

1 n XT S(PSf XS ˆβS) + 1

n XT Sε ˆΣS,M ˆβJ = λˆτ S, (13)

1 n XT M(PSf XS ˆβS) + 1

n XT M{(I PS)f + ε} ˆΣM,M ˆβM = λˆτ M. (14)

Now let βS be a solution to the restricted Lasso problem,

(ˆµ, βS) = arg min µ,βS

2n Y µ1 XSβS 2 + λ βS 1 .

The KKT conditions give that βS satisﬁes

1 n XT S(Y XS βS) = λ τ S, (15)

where τ S βS 1. We now claim that

(ˆβS, ˆβM) = ( βS, 0) (16)

(ˆτ S, ˆτ M) = τ S, ˆΣM,S ˆΣ 1 S,S( τ S 1

nλ 1XT Sε) + 1

nλ 1XT M{(I PS)f + ε} (17)

is the unique solution to (13), (14), (11) and (12). Indeed, as βS solves the reduced Lasso problem, we must have that (13) and (12) are satisﬁed. Multiplying (13) by XS ˆΣ 1 S,S, setting ˆβM = 0 and rearranging gives us that

PSf XS ˆβS = XS ˆΣ 1 S,S(λˆτ S 1

n XT Sε), (18)

and substituting this into (14) shows that our choice of ˆτ M satisﬁes (14). It remains to check that we have ˆτ M 1. In fact, we shall show that ˆτ M < 1. Since we are on ΩC,η and τ S 1, for u M we have

λ|ˆτ u| ˆΣ 1 S,S ˆΣS,{u} 1 λ τ S + 1

n XT Sε + 1

n XT u (I PS)f + 1

< λ ˆΣ 1 S,S ˆΣS,{u} 1 + 1

n XT u (I PS)f + 2η

where the ﬁnal inequality follows from (10). We have shown that there exists a solution, ˆβ, to the Lasso optimisation problem with ˆβM = 0. The uniqueness of this solution follows from noting that ˆτ M < 1, XS has full column rank and appealing to Lemma 1 of Wainwright (2009). For (ii), note that from (13), provided ˆβM = 0, we have that

ˆβS = βS ˆΣ 1 S,S(λˆτ S 1

But by assumption

|βS v | > ( ˆΣ 1 S,S)v 1(λ + η) ( ˆΣ 1 S,S) T

whence ˆβv = 0. (iii) follows easily from (i) and (ii).

Proof of Theorem 2. In all that follows, we work on the event ΩC ,η deﬁned in Lemma 4. Using standard bounds for the tails of Gaussian random variables and the union bound, it is easy to show that P(Ω1 ΩC ,η) 1 exp( t2/2). Let N = {1, . . . , p} \ S 1. Let T be the number of steps taken by the algorithm: this would typically be T, but may be smaller if a perfect ﬁt is reached or if p < T for example. Let Ck be the largest member of {C1, . . . , C T } satisfying Ck C . Such a Ck exists since C1 C . Now suppose for a contradiction that Ck S . Let j be such that

I(1, . . . , j 1) Ck,

with j maximal. Since I(1) = , such a j exists. Let A = Ck \ C1. Note that A J and

I(1, . . . , j 1) A C \ C1 = I(S 1).

By the entry order condition, we know that j will enter the active set before any variable in N, and before a perfect ﬁt is reached. Thus k +1 T and Ck+1 contains only additional interactions not involving any variables from N, so Ck+1 C .

Proof of Corollary 3. Let ΩC ,η be deﬁned as in Lemma 4. Also deﬁne the events

n XT N(I PS )ε η},

n ˆΣ 1 S ,S XT S ε ξ}

In all that follows, we work on the event Ω1 Ω2 ΩC ,η. As I PS is a projection,

n|Xv T (I PS )ε| η) P( 1

n|Xv T ε| η).

n ˆΣ 1 S ,S XT S ε N|S |(0, 1

nσ2 ˆΣ 1 S ,S ). Thus

P(Ω3) |S |P(|Z| ξ)

Modelling Interactions

where Z N(0, σ2/(ncmin( ˆΣS ,S ))). Note that

P(Ω1 Ω2 ΩC ,η) 1 P(Ωc C ,η) P(Ωc 1) P(Ωc 2).

Using this, it is straightforward to show that P(Ω1 Ω2 ΩC ,η) 1 3 exp( t2/2). Since we are on ΩC ,η, we can assume the existence of a k from Theorem 2. We now follow the proof of Lemma 4 taking S = S and M = Ck \ S N. The KKT conditions become

ˆΣS ,S (β S ˆβS ) + 1

n XT S ε ˆΣS ,M ˆβM = λˆτ S , (19) ˆΣM,S (β S ˆβS ) + 1

n XT Mε ˆΣM,M ˆβM = λˆτ M, (20)

with ˆτ also satisfying (11) and (12) as before. Now let λ be such that

1 ˆΣM,S ˆΣ 1 S ,S sgn(β S ) < λ < min v S

sgn(β S )T ( ˆΣ 1 S ,S )v 1 (|β v| ξ) .

It is straightforward to check that

(ˆβS , ˆβM) = (β S λ ˆΣ 1 S ,S sgn(β S ) + 1

n ˆΣ 1 S ,S XT S ε, 0)

(ˆτ S , ˆτ M) = sgn(β S ), ˆΣM,S ˆΣ 1 S ,S sgn(β S ) + 1

nλ 1XT M(I PS )ε

is the unique solution to (19), (20), (11) and (12).

Appendix C. Proof of Theorem 1

In the following, we make use of notation deﬁned in Section 6.2. In addition, for convenience we write S = S 1, M = S J . Also, we will write main eﬀects variables {j} as simply j. For any matrix M, M will denote maxjk |Mjk|. First we collect together various results concerning ˆΣC ,C .

Lemma 5 Consider the setup of Theorem 1. Let En and Varn denote empirical expectation and variance with respect to Z so that, for example Enzj = Pn i=1 Zij/n.

(i) Let D be the diagonal matrix indexed by C used to scale transformations of Z in order to create XC i.e. with entries such that D2 jj = Varn(zj) and D2 vv = Varn(zj Enzj)(zk Enzk) when v = {j, k}. Then

max j C1 |D2 jj 1| = OP ( p

log(p)/n) (21)

max {j,k} M |D2 {j,k},{j,k} 1 Σ2 jk| = OP ( p

log(s1)n 1/4) (22)

1 n XT J XS = OP ( p

log(s1)n 1/3) (23)

cmin( ˆΣS,S) c s1OP ( p

log(s1)/n) (24)

cmin( ˆΣM,M) c2 + s2 1OP ( p

log(s1)n 1/4) (25)

cmax( ˆΣJ ,J ) 2c 2 + s2 1OP ( p

log(s1)n 1/4). (26)

Proof We use bounds on the tails of products of normal random variables from Hao and Zhang (2014) (equation B.9). We have

max j,k |Covn(zj, zk) Σjk| = max j,k |En(zjzk) Enzj Enzk Σjk|

max j,k,l,m S |Covn (zj Enzj)(zk Enzk), (zl Enzl)(zm Enzm) ΣjlΣkm ΣjmΣkl|

= max j,k,l,m S |En(zjzkzlzm) En(zjzk)En(zlzk) ΣjlΣkm ΣjmΣkl| + OP ( p

log(s1)n 1/4).

Now we consider (ii). We have

1 n XT J XS max v J D 1 vv max k S D 1 kk max j,k,l S |Covn (zj Enzj)(zk Enzk), zl |

log(s1)n 1/3),

the rate being driven by the size of En(zjzkzl). Also

cmin( ˆΣS,S) = min τ Rs1: τ 2=1 τ{ΣS,S (ΣS,S ˆΣS,S)}τ

cmin(ΣS,S) max τ Rs1: τ 2=1 τ 2 1 ΣS,S ˆΣS,S

= c s1OP ( p

log(s1)/n).

Now let Σ be a matrix with entries indexed by M with

Σuv = ΣjlΣkm + ΣjmΣkl

when u = {j, k} and v = {l, m}. Lemma A.4 of Hao and Zhang (2014) shows that cmin( Σ) 2cmin(ΣS,S)2 and cmax( Σ) 2cmax(ΣS,S)2. Thus we have

cmin( ˆΣM,M) = min τ R|M|: DM,Mτ 2=1 τDM,M ˆΣM,MDM,Mτ

DM,M 1 cmin(DM,M ˆΣM,MDM,M)

{1 + OP ( p

log(s1)n 1/4)}[c2 s2 1{ Σ DM,M ˆΣM,MDM,M) + OP ( p

log(s1)n 1/3)}]

c2 + s2 1OP ( p

log(s1)n 1/4).

cmax( ˆΣJ ,J ) = max τ R|J |: DJ ,J τ 2=1 τDJ ,J ˆΣJ ,J DJ ,J τ

log(s1)n 1/4)}cmax(DJ ,J ˆΣJ ,J DJ ,J )

log(s1)n 1/4)}{2c 2 + s2 1 Σ DJ ,J ˆΣJ ,J DJ ,J ) }

2c 2 + s2 1OP ( p

log(s1)n 1/4).

Modelling Interactions

Lemma 6 Working with the assumptions of Theorem 1, we have

max A J βS A S β S OP ( q

s3 1 log(s1)n 1/3).

Proof For A J let A R|S A| with A S = βS A S β S and A A = βS A A . Deﬁne g = XS 2β S 2. Note that f = XSβ S + g ,

so A = (XT S AXS A) 1XS Ag .

First we bound A A 2 2 in terms of g 2 2. We have that

XS A A 2 2 = XS A S 2 2 + 2 A S T XT SXA A A + XA A A 2 2 g 2 2.

n XT SXS) A S 2 2 2 p

n XT SXA A A 2 A S 2+cmin( 1

n XT AXA) A A 2 2 1

Thinking of this as a quadratic in A S 2 and considering the discriminant yields

n XT SXS) g 2 2 cmin( 1

n XT SXS)cmin( 1

n XT AXA) 1

n XT SXA 2 |A||S|.

Thus by Lemma 5 (ii) and condition A2, max A J A A 2 = 1 n g 2OP (1). But 1 n g 2 q

cmax( ˆΣJ ,J ) β S 2 2 = OP (1)

by Lemma 5 (ii) and A5, so max A J A A 2 = OP (1). Next observe that XS A A g 2 2 XA A A g 2 2,

A S 2 2cmin( 1

n XT SXS) 1

n XS A S 2 2

n A S T XT S(XA A A g )|

|A||S| A S 2 1

n XT SXA A A 2 + 2 A S 2 1

A S 2{cmin( 1

n XT SXS)} 1( p

n XT SXA A A 2 + 1

n XT Sg 2),

max A J A S 2{cmin( 1

n XT SXS)} 1( p

n XT SXJ OP (1) + 1

n XT Sg 2).

n XT Sg 2 s1 1

n XT SXS 2 β S 2 1

log(s1)n 1/3).

Thus max A J A S OP ( q

s3 1 log(s1)n 1/3).

Proof of Theorem 1. In view of Theorem 2 and its proof, it is enough to show that with probability tending to 1, we have

max A J sup τ Rs1 ˆΣN,S A ˆΣ 1 S A,S Aτ < 1, (27)

min j I min A J |βS A j | > max A J max j N

n|XT j (I PS A)f | + 2 1

1 ˆΣ 1 S A,S A ˆΣS A,j 1 + 1

cmin( ˆΣM,M) .

First note that for j N, Zj = ZSΣ 1 S,SΣS,j + Ej where Ej is independent of ZS and Ej Nn(0, (1 Σj,SΣ 1 S,SΣS,j)I). Thus

Xj Djj = XSDS,SΣ 1 S,SΣS,j + Ej 1 Ej,

max A J (XT S AXS A) 1XT S AXj 1 D 1 kk DS,SΣ 1 S,SΣS,j 1 + max A J ˆΣ 1 S A,S A 1 n XT S AEj 1.

Now the second term above is at most

max A J max τ R|S A|: τ 2 1 ˆΣ 1 S A,S Aτ 1 1

n XT MEj 2.

max A J max τ R|S A|: τ 1 ˆΣ 1 S A,S Aτ 1

cmin( ˆΣM,M)

c2 + s2 1OP ( p

log(s1)n 1/4) .

Also since for v M and j N, XT v Ej/n N(0, 1) we have

n XT MEj 2 2 |M|OP (log(p)/n).

max A J sup τ Rs1 ˆΣN,S A ˆΣ 1 S A,S Aτ (1 + o P (1))δ + s2 1o P (1) c2 + o P (1).

Modelling Interactions

This shows that (27) is satisﬁed with probability tending to 1. Next

max j N max A J 1 n|XT j (I PS A)f | = max j N max A J D 1 jj n |ET j (I PS A)XAβ A|.

Since ET j (I PS A)XAβ A/n N(0, (I PS A)XAβ A 2 2/n2) we have

max j N max A J 1 n|XT j (I PS A)f |

n 1 n XS 2β S 2 2OP (1).

By (26) we have

1 n XS 2β S 2 2 {2c 2 + s2 1 p

log(s1)n 1/4OP (1)} βS 2 2.

Now using Lemma 6 we see that the diﬀerence between the LHS and RHS of (28) is at least

min j I |β j | OP q

s3 1 log(s1)n 1/3

( s1 + log p + σ log p)/ n

1 δ + o P (1) + σ

Thus A4 ensures that (28) holds with probability tending to 1.

A. Asuncion and D. J. Newman. UCI Machine Learning Repository, 2007. URL http: //archive.ics.uci.edu/ml.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization. Statistical Science, 27:450 468, 2012a.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4:1 106, 2012b.

P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Hierarchical selection of variables in sparse high-dimensional regression. IMS Collections, 6:56 69, 2010.

J. Bien, J. Taylor, and R. Tibshirani. A lasso for hierarchical interactions. Annals of Statistics, 41(3):1111 1141, 2013.

B. Bollob as. Combinatorics. Cambridge University Press, 1986.

L. Breiman. Random Forests. Machine Learning, 45:5 32, 2001.

P. B uhlmann and S. van de Geer. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360 1392, 2011a.

P. B uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer, 2011b.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Angle Regression. Annals of Statistics, 32:407 451, 2004.

M. Fanty and R. Cole. Spoken letter recognition. In R.P. Lippman, J. Moody, and D.S. Touretzky, editors, Advances in Neural Information Processing Systems, volume 13, San Mateo, CA, 1991. Morgan Kaufmann.

J. Friedman. Multivariate adaptive regression splines. Annals of Statistics, 19:1 67, 1991.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1 22, 2010.

N. Hao and H. H. Zhang. Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507):1285 1301, 2014.

A. Haris, D. Witten, and N. Simon. Convex modeling of interactions with strong heredity. Journal of Computational and Graphical Statistics, (just-accepted):1 35, 2015.

T. Hastie, R. Tibshirani, F. Leisch, K. Hornik, and B. D. Ripley. mda: Mixture and ﬂexible discriminant analysis, 2013. URL http://CRAN.R-project.org/package=mda. R package version 0.4-4.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal Methods for Hierarchical Sparse Coding. Journal of Machine Learning Research, 12:2297 2334, 2011.

M. Lim and T. Hastie. Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3):627 654, 2015.

Y. Lin and H. H. Zhang. Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics, 35:2272 2297, 2006.

L. Meier, S. van de Geer, and P. B uhlmann. High-dimensional additive modelling. Annals of Statistics, 37:3779 3821, 2009.

N. Meinshausen and P. B uhlmann. High dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34:1436 1462, 2006.

R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2005. URL http: //www.R-project.org. ISBN 3-900051-07-0.

P. Radchenko and G. James. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105:1541 1553, 2010.

P. Ravikumar, J. Laﬀerty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B, 71:1009 1030, 2009.

R. D. Shah and N. Meinshausen. Random intersection trees. The Journal of Machine Learning Research, 15(1):629 654, 2014.

Modelling Interactions

R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1996.

B. Turlach. Discussion of Least angle regression . Annals of Statistics, 32:481 490, 2004.

M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. IEEE Transactions on Information Theory, 55:2183 2202, 2009.

J. Wu, B. Devlin, S. Ringquist, M. Trucco, and K. Roeder. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic Epidemiology, 34: 275 285, 2010.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49 67, 2006.

M. Yuan, V. R. Joseph, and Y. Lin. An eﬃcient variable selection approach for analyzing designed experiments. Technometrics, 49:430 439, 2007.

M. Yuan, R. Joseph, and H. Zou. Structured variable selection and estimation. Annals of Applied Statistics, 3:1738 1757, 2009.

P. Zhao and B. Yu. On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7:2541 2563, 2006.

P. Zhao, G. Rocha, and B. Yu. The composite absolute families penalty for grouped and hierarchical variable selection. Annals of Statistics, 37:3648 3497, 2009.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418 1429, 2006.