# model_selection_via_the_vc_dimension__cb42ee15.pdf

Journal of Machine Learning Research 20 (2019) 1-26 Submitted 11/17; Revised 3/19; Published 5/19

Model Selection via the VC Dimension

Merlin Mpoudeu merlin.mpoudeu@huskers.unl.edu Bank of America Atlanta, GA, USA

Bertrand Clarke bclarke3@unl.edu Department of Statistics University of Nebraska-Lincoln Lincoln, NE 68503, USA

Editor: John Shawe-Taylor

We derive an objective function that can be optimized to give an estimator for the Vapnik Chervonenkis dimension for use in model selection in regression problems. We verify our estimator is consistent. Then, we verify it performs well compared to seven other model selection techniques. We do this for a variety of types of data sets. Keywords: Vapnik-Chervonenkis dimension, model selection, Bayesian information criterion, sparsity methods, empirical risk minimization, multi-type data.

1. Complexity and Model Selection

Model selection is often the ﬁrst problem that must be addressed when analyzing data. In M-closed problems, see Bernardo and Smith (2000), the analyst posits a list of models and assumes one of them is true. In such cases, model selection is any procedure that uses data to identify one of the models on the model list. There is a vast literature on model selection in this context including information based methods such as the Aikaikie Information Criterion (AIC), the Bayes information criterion (BIC), residual based methods such as Mallows Cp or branch and bound, and code length methods such as the two-stage coding proposed by Barron and Cover (1991). We also have computational search methods such as simulated annealing and genetic algorithms. In addition, cross-validation (CV) is often used with non-parametric methods such as recursive partitioning, neural networks (aka deep learning) and kernel methods. A less well developed approach to model selection is via complexity as assessed by the Vapnik-Chervonenkis (VC) dimension, here denoted by d V C. Its earliest usage seems to be in Vapnik and Chervonenkis (1968). A translation into English was published as Vapnik and Chervonenkis (1971). Although, the VC dimension goes back to 1968, it wasn t until Vapnik et al. (1994) that a method for estimating d V C was proposed in the classiﬁcation context. Speciﬁcally, given a collection C of classiﬁers, Vapnik et al. (1994) tried to estimate the VC dimension of C by deriving an objective function based on the expected value of the maximum diﬀerence between two empirical evaluations of a single loss function, here denoted by . The two empirical values come from dividing a given data set into a ﬁrst and second part. The

c 2019 Merlin Mpoudeu and Bertrand Clarke.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-669.html.

MPOUDEU and CLARKE

objective function proposed by Vapnik et al. (1994) depends on d V C, the sample size n, and several constants that had to be determined. Using their objective function, they derived an estimator ˆd V C for d V C given a class C of classiﬁers. This algorithm treated possible sample sizes as design points n1, n2, , n L and requires one level of bootstrapping. Despite the remarkable contribution of Vapnik et al. (1994), the objective function was over-complex and the algorithm did not give a tight enough bound on . Later, Vapnik and his collaborators suggested a ﬁx to tighten the bound on . We do not use this here; it is unclear if this ﬁx will work in classiﬁcation, let alone regression.

Choosing the design points is a nontrivial source of variability in the estimate of d V C. So, Shao et al. (2000) proposed an algorithm, based on extensive simulations, to generate optimal values of n1, n2, , n L, given L. They argued that non-uniform values of the nl s gave better results than the uniform nl s used in Vapnik et al. (1994).

More recently, in a pioneering paper that deserves more recognition that it has received, Mc Donald et al. (2011) established the consistency of the Vapnik (1998) estimator ˆd V C for d V C in the classiﬁcation context.

The main reason the estimator for d V C of Vapnik et al. (1994) did not become more widely used, despite the result in Mc Donald et al. (2011), is, we suggest, that it was too unstable because the objective function did not bound tightly enough in terms of d V C. In addition, the form of the objective function in Vapnik et al. (1994) is more complicated and less well-motivated than our result Theorem 2. The reason is that the derivation in Vapnik et al. (1994) uses conditional probabilities, one of which goes to zero quite quickly (with n). So, it contributes negligibly to the upper bound. Our derivation ignores the conditioning and bounds a CV form of that is typically larger than that used in Vapnik et al. (1994).

Our consistency proof is a simpliﬁcation of the proof of the main result Mc Donald et al. (2011). Accordingly, we obtain a slower rate of consistency, but the probability of correct model selection still goes to one.

Our overall strategy is to derive an objective function for estimating d V C in the regression setting that provides, we think, a tighter bound on a modiﬁed form of . To convert from classiﬁcation to regression, we discretize the loss used for regression into m intervals (the case m = 1 would then apply to classiﬁcation). To get a tighter bound, we change the form of from what Vapnik et al. (1994) used and we optimize over the leading factor in our upper bound. To use our estimator, we use an extra layer of bootstrapping so the quantity we empirically optimize represents the quantity we derive theoretically more accurately. The extra layer of bootstrapping stabilizes our estimator of d V C and appears to reduce its dependency on the nl s. If the models are nested in order of increasing VC dimension, it is straightforward to choose the model with VC dimension closest to our estimate ˆd V C. Otherwise, we can convert a non-nested problem to the nested case by ordering the inclusion of the covariates using a shrinkage method such as the smoothly clipped absolute deviation (SCAD, Fan and Li (2001)), or correlation (see Fan and Lv (2008)), and use our ˆd V C as before. Even when we force a model list to be nested, our model selection method performs well compared to a range of competitors including Vapnik et al. (1994) s original method, two forms of penalized empirical risk minimization (denoted \ PERM 1 and \ PERM 2), AIC, BIC, CV (10-fold), SCAD, and adaptive LASSO (ALASSO, Zou (2006)). Our general ﬁndings indicate that in realistic settings, model selection via estimated VC dimension,

MODEL SELECTION VIA THE VC DIMENSION

when properly done, is fully competitive with existing methods and, unlike them, rarely gives abberant results. This manuscript is structured as follows. In Sec. 2 we present the main theory justifying our estimator. In Subsec. 2.1 we discretize bounded loss functions so that upper bounds for the distinct regions involved in the deﬁnition of can be derived and in Subsec. 2.2 we deﬁne our estimator of the VC dimension and give an algorithm for how to compute it. In Sec. 3 we use Mc Donald et al. (2011) s consistency theorem to motivate our consistency theorem for ˆd V C. In Sec. 4 we present our studies using simulated, benchmark, and real data. We compare our method for model selection to AIC, BIC, CV, \ PERM 1, and \ PERM 2. In this context, we suggest criteria to guide the selection of design points. Our comparisons also include simplifying non-nested model lists by using correlation, SCAD, and ALASSO. In Sec. 5 we discuss our overall ﬁndings.

2. Deriving an optimality criterion for estimating VC dimension

This section concerns , the expected supremal diﬀerence between two evaluations of a bounded loss function, formally deﬁned in (18) and (19). These bounds will enable us to derive an estimator for the VC dimension. In Sec. 2.1, we present our alternative version of the Vapnik et al. (1994) bounds and in Sec. 2.2, we present our estimator of d V C.

2.1. Extension of the Vapnik et al. (1994) bounds to regression

Let Z = (X, Y ) be a random variable with outcomes z = (x, y) assuming values on Z = X Y. The ﬁrst entry, X = x, is regarded as an explanatory variable leading to Y = y. Let P M(Z) be the distribution of Z, where M(Z) is the collection of probability measures on Z, and let Z1:2n = (Z1, . . . , Zn, Zn+1, . . . , . . . , Z2n) be a data set of size 2m of independently and identically distributed (IID) copies of Z. Write D1 = {Z1, . . . , Zn} for the ﬁrst half and D2 = {Zn+1, . . . , Z2n} for the second half. Writing Zi = (Xi, Yi) for i = 1 . . . 2n, let

Q (Zi, α) = L (Yi, f (Xi, α)) ,

for a bounded real valued loss function L and α Λ. We assume that Λ is a compact set in a ﬁnite dimensional real space, that the interior of Λ, Int(Λ), is non-void and convex, and that Λ = Int(Λ). Also, we assume the continuous functions f( | α) are parametrized by α continuously and one-to-one. Thus, in our examples, Λ will be the parameter space for a class of regression functions f( | α). For ease of exposition we assume L, and hence Q, are also continuous. For a ﬁxed α Λ, discretize Q(z, α) using m disjoint intervals (with union [0, B)):

Qm (z, α) =

2m I Q(Z, α) Im j . (1)

The discretization is based on the uniform left-closed, right-open partition of [0, B) into m subintervals, here denoted Im j and the numbers ((2j +1)B)/(2m) are the midpoints. In (1), I[ ] is an indicator function taking value 1 when its argument is true and value 0 when it is false. We use losses of the form (1) to deﬁne a cross-validation form for .

MPOUDEU and CLARKE

Start by letting α1, α2 Λ, with α1 = α2, and let

ν(D2, α1) = 1

i=1 Q (Zn+i, α1) and ν(D1, α2) = 1

i=1 Q (Zi, α2) . (2)

These are the empirical risks of model α1 on the second half of the data and of model α2 on the ﬁrst half of the sample, respectively. Observe that the empirical counts of the data points whose losses land in Im j are

Nm j (D2, α1) =

i=1 I [Q(Zn+i, α1) Im j ] and Nm j (D1, α2) =

i=1 I [Q(Zi, α2) Im j ]. (3)

This means we are counting the errors of the α1 model on the second half of the data and the erors of the α2 model on the ﬁrst half of the data. This begins the set up of the cross-validation form of the error that we use and leads to the following expressions for the empirical losses of the discretized loss functions:

νm D2, α1 = 1

j=0 Nm j (D2, α1)(2j + 1)B

and νm D1, α2 = 1

j=0 Nm j (D1, α2)(2j + 1)B

It is seen that the expressions in (4) are formed from the counts within each of the intervals. Let these be denoted by

νm j (D2, α1) = 1

n Nm j (D2, α1)(2j + 1)B

2m and νm j (D1, α2) = 1

n Nm j (D1, α2)(2j + 1)B

(5) The ﬁrst step in bounding is to bound the probability of the bad set where νm(D2, α1) and νm(D1, α2) are not close. Let ϵ > 0 and, using the discretization into m intervals, deﬁne the set Aϵ by the union:

m=0 Aϵ,m, (6)

Z1:2n sup α1,α2 Λ [νm(Zn+1:2n, α1) νm(Z1:n, α2)] ϵ

The only way a Z1:2n = z1:2n can be in Aϵ,m is that at least one value of j satisﬁes

sup α1, α2 Λ (νm j (zn+1:2n, α1) νm j (z1:n, α2)) ϵ

Since Aϵ is deﬁned on the entire range of our loss function, and we want to partition the range into m disjoint intervals, write

Aϵ,m {Z1:2n j sup α1, α2 Λ (νm j (Zn+1:2n, α1) νm j (Z1:n, α2)) ϵ

j=0 Aϵ,m,j,

MODEL SELECTION VIA THE VC DIMENSION

where Aϵ,m,j = {Z1:2n sup α1, α2 Λ (νm j (Zn+1:2n, α1) νm j (Z1:n, α2)) ϵ

Next, ﬁx any value j {0, 1, . . . , m 1}. For any ﬁxed z1:2n, and any given α1, α2 Λ, deﬁne the vector of length 2n

(Qm j (zn+1, α1), . . . , Qm j (z2n, α1), Qm j (z1, α2), . . . , Qm j (zn, α2)), (8)

where Qm j (z, α) = I(Q(z, α) Im j ). Now deﬁne (α1, α2) (α 1, α 2) when the corresponding 2n-tuples are equal (for the given z1:2n). It is seen that is an equivalence relation on Λ Λ and therefore partitions Λ Λ into disjoint equivalence classes. Let Kj = Kj(z1:2n) be the number of equivalence classes for given j and z1:2n and for given (α1, α2) Λ Λ, write [(α1, α2)] for the equivalence class that contains it. Now, for k = 1 . . . , Kj let

α 1jk, α 2jk = arg sup α1,α2 (Λ Λ)k

νm j (zn+1:2n, α1) νm j (z1:n, α2) , (9)

where (Λ Λ)k is the k-th equivalence class. Now, SKj k=1 h (α 1jk, α 2jk) i = Λ Λ and h (α 1jk, α 2jk) i h (α 1jk, α 2jk ) i = φ unless k = k .

Any permutation π of {1, . . . , 2n} induces a permutation map Tπ : Z2n Z2n which acts by shuﬄing coordinates according to the indices permuted by π. There are (2n)! such maps that can be denoted Ti for i = 1, . . . , 2n. The IID assumption implies that the distribution of any Ti(Z1:2n) is the same as the distribution of Z1:2n. So, if any function f : Z2n R satisﬁes the symmetry condition f(Ti(z1:2n)) = f(z1:2n) and is integrable, its integral satisﬁes Z

Z2n f (Z1:2n) d P 2n (z1:2n) = Z

Z2n f (Ti Z1:2n) d P 2n (z1:2n) , (10)

in which d P 2n(z1:2n) = d P(z1) d P(z2n) and P M(Z). One of the quantities that will be essential to getting a tight enough bound on P 2n(Aϵ) is the annealed entropy HΛ ann(2n). Given a sample, say z1:2n, let NΛ(z1:2n) be the number of diﬀerent separations of z1:2n by a given set of functions. In the proof we will choose all the functions in (8) for a given j. Since NΛ(z1:2n) 22n and the NΛ(z1:2n) s are measurable, ENΛ(Z1:2n) exists. The annealed entropy is the natural logarithm (base e) of this, HΛ ann(2n) = log ENΛ(Z1:2n). As is customary, E means expectation in the true distribution, P M(Z). Our ﬁrst main result is similar to the corrresponding result in Vapnik et al. (1994). However, there are numerous diﬀerences are in the details. For instance, our equivalence class is deﬁned on Λ Λ, we use a cross-validation form of the error, we discretized the loss function, and our result leads to Theorem 2 that only has one term, whereas the corresponding result in Vapnik et al. (1994) has three terms.

Theorem 1 : Let ϵ 0 and m N. If d V C = V C ({Q ( , α) : α Λ}) is ﬁnite, then

P 2n (Aϵ) 2m 2ne

d V C exp nϵ2

MPOUDEU and CLARKE

Remark: The technique used to prove (11) is similar to the proof of Theorem 4.1 in Vapnik (1998) giving bounds for the uniform convergence of the empirical risk. The hypotheses of Theorem 4.1 in Vapnik (1998) require only the existence of the key quantities e.g the annealed entropy, and the growth function. Our only extra condition is that d V C be ﬁnite. Proof : Let m N and j {0, 1, . . . , m 1}. For any given z1:2n, α1, and α2 write m j (z1:2n, α1, α2) = νm j (zn+1:2n, α1) νm j (z1:n, α2). Also, denote

α 1j, α 2j = arg sup (α1,α2) Λ Λ m j (z1:2n, α1, α2)

= arg sup (α1,α2) Λ Λ

νm j (zn+1:2n, α1) νm j (z1:n, α2) (12)

It is seen that (α 1j, α 2j) are estimated using D2 and D1 respectively; this reversal of the estimators with respect to the data is the essence of the cross-validation form of the error that we use. Using some manipulations, we have by dropping the superscript 2n on P:

j=0 P (Aϵ,m,j)

z1:2n : sup α1, α2 Λ

νm j (zn+1:2n, α1) νm j (z1:n, α2) ϵ

z1:2n : sup α1, α2 Λ m j (z1:2n, α1, α2) ϵ

j=0 P n z1:2n : m j (z1:2n, α 1j, α 2j) ϵ

Now, for each Ti write Ti(z1:2n) for the permuted sample and correspondingly write D1 Ti(z1:2n) and D2 Ti(z1:2n) for the ﬁrst and second halves of the permuted sample. This implies

m j (Ti(z1:2n), α1, α2) = νm j (D2 Ti(z1:2n), α1) νm j (D1 Ti(z1:2n), α2).

By symmetry of this function and (10) we can write

P n z1:2n : m j (z1:2n, α 1j, α 2j) ϵ

o = 1 (2n)!

i=1 P n z1:2n : m j (Ti(z1:2n), α 1j, α 2j) ϵ

and therefore P(Aϵ) is bounded from above by

i=1 P n z1:2n : m j (Ti(z1:2n), α 1j, α 2j) ϵ

MODEL SELECTION VIA THE VC DIMENSION

Using the properties of the equivalence relation and letting Z denote a dummy variable with the same distribution as Z, we have that for each ﬁxed i, j and z1:2n

I{Z 1:2n: m j (Ti Z 1:2n,α 1j,α 2j) ϵ

m}( ) I{Z 1:2n: m j (Ti Z 1:2n,α 1j1,α 2j1) ϵ

+ + I Z 2n: m j

Ti Z 1:2n,α 1j Kj(z1:2n)α 2j Kj(z1:2n)

Kj(z1:2n) X

k=1 I{Z 1:2n: m j (Ti Z 1:2n,α 1jkα 2jk) ϵ

m} ( ) . (14)

The inequality in (14) follows because each z 1:2n making the indicator function on the left side 1, must make at least one of the indicators on the right 1. This follows from the fact that (α 1j, α 2j) is a global maximum and each (α 1jk, α 2jk) is a local maximum for n equivalence class, see (12) and (9). Note that in (14) a Ti appears. Formally, this necessitates choosing (α 1j, α 2j) and each (α 1jk, α 2jk) for given k to be dependent on the i in Ti also; this extra step is suppressed in the notation since i has been dropped for ease of exposition. Now, using (14), (13) is bounded by

P (Aϵ) 1 (2n)!

Z Kj(z1:2n) X

k=1 I{Z 1:2n: m j (Ti Z 1:2n,α 1jkα 2jk) ϵ

m} (z1:2n) d P (z1:2n)

Kj(z1:2n) X

i=1 I{Z 1:2n: m j (Ti Z 1:2n,α 1jk,α 2jk) ϵ

d P (z1:2n) .

To bound the summation in square brackets, we follow Vapnik (1998), Chap. 4. Let

Aϵ,m,j,k = n Z1:2n : m j Ti Z1:2n, α 1jkα 2jk ϵ

for ﬁxed j and each k, where h (α 1jk, α 2jk) i = (Λ Λ)k. Now, the summation in square brackets is the fraction of the number of the (2n)! permutations Ti of Z1:2n for which Aϵ,m,j,k is closed under Ti for any ﬁxed equivalence class (Λ Λ)k. As proved in Vapnik (1998) Sec. 4.13, it equals

b ℓ 2n b n ℓ

where b = b(z1:2n) is the number of zi s in z1:2n that satisfy Q(zi, α 1jk) = 1 (for i = 1, . . . , n) or Q(zi, α 2jk) = 1 (for i = n+1, . . . , 2n), see Vapnik (1998), p. 136 or 143. The summation is over ℓ s in the set ℓ: ℓ n b ℓ

From Sec. 4.13 in Vapnik (1998), we have Γk 2 exp nϵ2

m2 uniformly in k.

MPOUDEU and CLARKE

So, using this in the last bound on P(Aϵ) gives that P (Aϵ,m) is bounded from above by

Kj(z1:2n) X

k=1 2 exp nϵ2

d P(z1:2n) = 2 exp nϵ2

Kj(z1:2n) X

k=1 d P(z1:2n)

= 2 exp nϵ2

Z Kj(z1:2n) X

k=1 d P(z1:2n) = 2 exp nϵ2

Z Kj(z1:2n)d P(z1:2n)

= 2 exp nϵ2

j=0 E (Kj(Z1:2n)) . (15)

Since Kj(z1:2n) is the number of equivalence classes given α1, α2, j, and z1:2n and NΛ

is the number of separations of z1:2n given by the functions in (8) i.e., over all α1, α2 Λ, we have that Kj(z1:2n) NΛ(z1:2n).

The reasoning is as follows and simply makes the reasoning behind the statement at the top of p. 136 in Vapnik (1998) explicit. Recall Kj(z1:2n) is the number of equivalence classes in Λ Λ for ﬁxed j and z1:2n. If (α1, α2) and (α 1, α 2) are in diﬀerent equivalence classes then

u Qm j (zu, α1) = Qm j (zu, α 1) or Qm j (zu, α2) = Qm j (zu, α 2).

Without loss of generality, suppose the ﬁrst inequality holds for some u. Then the two functions Qm j (zu, α1), Qm j (zu, α 1) must assume values (0, 1) or (1, 0). Again, without loss of generality suppose the ﬁrst holds. Then, these two functions can separate z1:2n into two disjoint subsets {zv Qm j (zv, α1) = 0} and {zv Qm j (zv, α1) = 0} and this is one of the separations counted by NΛ(z1:2n). Taking into account all such separations we have

EKj(z1:2n) ENΛ(z1:2n). (16)

The growth function is deﬁned to be

GΛ(2n) = log sup z1,z2,...,z2n NΛ (z1, . . . , z2n) ENΛ(Z1:2n) = Hann(2n).

So it is easy to see that HΛ ann(2n) GΛ(2n). Now, Theorem 4.3 from Vapnik (1998) p.145 gives that

G(2n) d V C log 2ne

E NΛ(z1:2n) 2ne

d V C . (17)

Using (17) and (16) m times in (15) gives the theorem.

Next, we use use Theorem 1 to identify an objective function that can be minimized to give an estimator for d V C. Formally, let

sup α1,α2 Λ

νm(D2, α1) νm(D1, α2) !

MODEL SELECTION VIA THE VC DIMENSION

sup α1,α2 Λ

ν D2, α1 ν D1, α2 !

Obviously, m provided that m, n, and d V C at appropriate rates and the argument of m satisﬁes appropriate uniform integrability conditions. In fact, we do not use m . For our purpose, the following bounds are suﬃcient. They are important to our methodology because they bound the expected maximum diﬀerence between two values of the empirical losses by an expression that can be used to estimate the VC dimension.

Theorem 2 :

1. If d V C < , we have

n log 2m3 2ne d V C

2. If d V C < , and

P {Q (z, α) c}dc

where 1 < p 2 is some ﬁxed parameter, we have

Dp(α )22.5+ 1

d V C log ne d V C

p + 16Dp(α )22.5+ 1

d V C log ne d V C

3. Assume that d V C , n d V C , m , log (m) = o(n), and

P {Q(z, α) c}dc

where p = 2. Then we have that

min (1, 8Dp(α ))

Proof : Proofs of the three clause of Theorem 2 can be found in Mpoudeu (2017) in Appendices A1 A3. They rest on using the integral of probabilities identity and then bounding the probabilities as in Theorem 1.

We can also use Theorem 1 to obtain an upper bound on the unknown true risk via the following propositions. Let Q (αk) be the true unknown risk at αk and Qemp (αk) be the empirical risk at αk. Assume that K values of αk have been ﬁxed. These correspond to a set of points in the parameter space; in our examples below we use estimates. In eﬀect, we are assuming that given an estimate, there is an αk so close to it that the approximation error is negligible.

MPOUDEU and CLARKE

Proposition 3 : For any η (0, 1), with probability at least 1 η, the inequality

Q (αk) Qemp (αk) + m

holds simultaneously for all functions Q (z, αk), k = 1, 2, , K.

Remark: This inequality follows from the additive Chernoﬀbounds (see, e.g., Vapnik (1998), formulae (4.4) and (4.5)) and suggests that the best model will be the one that minimizes the RHS of (23). The use of (23) in model selection as a form of risk minimization because as d V C increases the second term on the right increases. This limits the size of d V C; we denoted this technique by PERM1 since a penalized empirical risk is being minimized. Proof : To obtain inequality (23), we equate the RHS of Theorem 1 to a positive number 0 η 1. Thus:

d V C exp nϵ2

Solving for ϵ gives

Proposition 3 can be obtained from the additive Chernoﬀbounds, expression 4.4 in Vapnik (1998) as follows

Q (αk) Qemp (αk) + ϵ. (25)

Using (24) in inequality (25), completes the proof.

Parallel to Prop. 3, we have the following for the multiplicative case.

Proposition 4 : For any η (0, 1), with probability 1 η, the inequality

Q (αk) Qemp (αk) + m2

v u u u t1 + 4n Qemp (αk)

η 2ne d V C

(26) holds simultaneously for all K functions in the set Q (z, αk), k = 1, 2, . . . , K.

Remark: This follows from the multiplicative Chernoﬀbounds (see e.g., Vapnik (1998) formulae (4.17) and (4.18)) and suggests that the best model will be the one that minimizes the right hand side (RHS) of (26). Analogous to (23) we refer to the use of (26) in model selection as PERM2.

MODEL SELECTION VIA THE VC DIMENSION

Proof : Let ϵ, η > 0. Then, inequality (4.18) in Vapnik (1998) gives, with probability at least 1 η, that

Q (αk) Qemp (αk) p

Routine algebraic manipulations and completing the square give

Q (αk) 0.5 ϵ2 + 2Qemp (αk) 2 0.25 ϵ2 + 2Qemp (αk) 2 Q2 emp (αk) .

Taking the square root on both sides and re-arranging gives

Q (αk) Qemp (αk) + 0.5ϵ2

1 + 4Qemp (αk)

Using (24) in the last inequality completes the proof of the Proposition.

More details on the use of Propositions 3 and 4 can be found in Vapnik (1998) and Mpoudeu (2017).

2.2. An Estimator of the VC Dimension

The upper bound from Theorem 2 can be written as

Φd V C(n) = min (1, 8Dp(α ))

This expression is meaningfully diﬀerent from the form derived in Vapnik et al. (1994) and studied in Mc Donald et al. (2011). Moreover, although min (1, 8Dp(α )) does not aﬀect the optimization, it might not be the best constant for the inequality in (22). So, we replace it with an arbitrary constant c over which we optimize to make our upper bound as tight as possible. In our computations, we let c vary from 0.01 to 100 in steps of size 0.01. However, we have observed in practice that the best value of ˆc is usually between 1 and 8. The technique that we use to estimate ˆd V C is also diﬀerent from that in Vapnik et al. (1994). Indeed, our Algorithm #1 below accurately encapsulates the way the LHS of (22) is formed unlike the algorithm in Vapnik et al. (1994). In particular, we use two bootstrapping procedures, one as a proxy for calculating expectations and the second as a proxy for calculating a maximum. Moreover, we split the data set into two subsets. Using the ﬁrst data set, we ﬁt model I and using the second we ﬁt model II. To explain how we ﬁnd our estimate of the RHS of (22) from Theorem 2, we start by replacing the sample size n in (27) with a speciﬁed value of design point, so that the only unknown is d V C. Thus, formally, we replace (27) by

Φ d V C(nl) = bc

nl log 2nle

MPOUDEU and CLARKE

where ˆc is the optimal data driven constant. If we knew the left hand side (LHS) of (22), even computationally, we could use it to estimate d V C. However, in general we don t know the LHS of (22). Instead, we generate one observation of the form

ξ (nl) = Φ d V C(nl) + ϵ(nl) (28)

for each design point nl by bootstrapping and denoted the realized values by ˆξ (nl). In (28), we assume ϵ(nl) has a mean zero, but an otherwise unknown, distribution. We can therefore obtain a list of values of bξ(nl) for the elements of NL. In eﬀect, we are assuming that Φ d V C(nl) provides a tight bound on , and hence m as suggested by Theorem 2. Our algorithm is as follows.

Algorithm #1

A collection of regression models G = {gβ},

A data set,

Two integers b1 and b2 for the number of bootstrap samples,

An integer m for the number of subintervals to discretize the losses,

A set of design points NL = {n1, n2, . . . , n L}.

For each l = 1, 2, . . . , L do:

1. Take a bootstrap sample of size 2nl (with replacement) form the data set;

2. Randomly subdivide the bootstrap data into two groups G1 and G2 of size nl each;

3. Fit two models, one for G1 and one for G2;

4. Compute the squared error for each model on the covariates and responses that the other model was trained on. Thus:

SE1 = (predict(Model1, x2) y2)2 and SE2 = (predict(Model2, x1) y1)2

where (x1, y1) ranges over G1 and (x2, y2) ranges over G2. So, there are nl values of SE1 and nl values of SE2.

5. Discretize the loss function, i.e. put each SE1 and SE2 in one of the m disjoint intervals;

6. Estimate νm j (G2, α1) and νm j (G1, α2) using the SE1 s and SE2 s respectively in the intervals Im j for j = 0, 1, . . . , m 1;

7. Compute the diﬀerences |νm j (G2, α1) νm j (G1, α2)| for j = 0, 1, . . . , m 1;

8. Repeat Steps 1-7 b1 times, take the mean interval-wise and sum it across all intervals, so we have:

j=0 mean |νm j (G2, α1) νm j (G1, α2)| ;

MODEL SELECTION VIA THE VC DIMENSION

9. Repeat Steps 1-8 b2 times to get rb1,i for i = 1, 2, . . . , b2 and form

i=1 rb1,i(nl) .

It is seen that Step 9 uses a mean even though the deﬁnition of m and (see (18) and (19)) has a supremum inside the expectation. This is intentional because using a supremum within each interval gave a worse estimator. We suggest that summing the mean over the intervals performs well because it is not too far from the supermum and is more stable. Note that this algorithm is parallelizable because diﬀerent nl can be sent to diﬀerent nodes to speed the process of estimating ˆξ ( ) for all nl. After obtaining ˆξ(nl) for each value of nl, we estimate d V C by minimizing the squared distance between ˆξ(nl) and Φ d V C(nl). Our objective function is

fnl(d V C) =

nl log 2nle

where L is the number of design points. Optimizing (29) usually only leads to numerical solutions and in our work below, we set b1 = b2 = W for convenience.

3. Proof of Consistency

In this section, we provide a proof of consistency for the estimator ˆd V C for d V C that we presented in Subsec. 2.2. In many respects, the structure of this proof should be credited to Mc Donald et al. (2011). Our contribution is to adapt Mc Donald et al. (2011) to our stable estimator for the regression context. We begin with some notation and deﬁnitions. Let Φ = {φd V C,c} be a collection of real valued functions parametrized by d V C H = [1, M] and c I = [a, b] R with M N large enough and 0 < a < b < so that b a > 0 is also large enough. Elements of this collection are of the form

φd V C,c (nl) = c

nl log 2nle

as derived in Subsec. 2.2 (see expression (27)). In expression (30), we assume L values n1, . . . , n L have been pre-speciﬁed. Fix a value of c and let Φc Φ be the section of elements corresponding to the ﬁxed c. The proof holds for each ﬁxed c and if we optimize over c to obtain ˆc as explained in Subsec. 2.2, the convergence of ˆd V C to the true value d V C will only be faster. The collection of functions Φ is the continuous image of a compact set and hence is compact. Now, without loss of generality, we can choose R > supd V C φd V C L where the norm L is derived from the inner product

l=1 f (nl) g (nl)

MPOUDEU and CLARKE

for real valued functions of a real variable. Thus φd V C = (φd V C (n1) , . . . , φd V C (n L)) (where the subscript c on the φd V C,c (n L) s in expression (30) have been dropped for ease of notation). Fix a value of c and consider the compact subclass of Φ given by

Φc(R) = φ Φc : φ φd V C L < R , (31)

where φd V C is the element of Φc corresponding to the correct value of d V C. For a given nl, we have

i=1 rb1,i(nl) (32)

where rb1,i(nl) is the i bootstrapped value of the integrand of m for each nl, i = 1, . . . , W

and l = 1, . . . , L. In vector form, write ˆξ = ˆξ (n1) , . . . , ˆξ (n L) . Using (28), each ˆξ (nl) can be represented as

ˆξ (nl) = φd V C (nl) + ϵ (nl) . (33)

We have the following result.

Theorem 5 : Suppose the true d V C [1, M] and that i = 1, . . . , W, l = 1, . . . , L, rb1,i (nl) N φd V C(nl), σ2 and independent, E (ϵ(nl)) = 0, V ar(ϵ(nl)) = σ2. Then, on Φc(R), as n , m and W = W(n) at suitable rates we have that

P φ ˆd V C φd V C L δ = O 1

Remark: In fact, the rb1,i(nl) s are only approximately independent N φd V C(nl), σ2 . However, as n increases they become closer and closer to being independent N φd V C(nl), σ2 , assuming φd V C(nl) is a tight enough upper bound, as n, m at appropriate rates. Also, it is seen that if L = L(n) is increasing then L averages the evaluations of more and more components of, say, φ ˆd V C. In the limit, this can be exihibited as an integral, i.e. as a quadratic norm. So, L can be regarded as an approximation of a L2-space norm that strengthens as a norm (or inner product) as n . In Theorem 5, if we controlled the distance between L and its limit, we could get a stronger mode of consistency. Proof : By deﬁnition of φd V C, we have

nl log 2nle

nl log 2nle

or more compactly ˆξ φ ˆd V C

L ˆξ φd V C 2

L. Expanding both sides of (35) gives

φ2 ˆd V C (nl) φ2 d V C (nl) 2

l=1 ˆξ (nl) φ ˆd V C φd V C (nl)

MODEL SELECTION VIA THE VC DIMENSION

L φd V C 2 L 2 L

l=1 (φd V C (nl) + ϵ (nl)) φ ˆd V C (nl) φd V C (nl)

φd V C (nl) φ ˆd V C (nl) φ2 d V C (nl)

+ ϵ (nl) φ ˆd V C (nl) φd V C (nl) .

Rearranging gives φ ˆd V C

L 2 ϵ, φ ˆd V C + φd V C 2 L 2 ϵ, φd V C φ ˆd V C L,

where ϵ = (ϵ (n1) , . . . , ϵ (n L)), i.e. φ ˆd V C φd V C 2

L 2 ϵ, φ ˆd V C φd V C L. (36)

It is seen that the LHS is the main quantity we want to control. We have

P φ ˆd V C φd V C L > δ P ϵ, φd V C φ ˆd V C δ2

P ϵ L φd V C φ ˆd V C

δ2 E ϵ 2 L , (37)

using the Cauchy-Schwarz inequality, the bound in (31), and Markov s inequality. By construction, we have that

E ϵ 2 L = 1 L

l=1 E ϵ2 (nl)

i=1 rb1,i(nl)

i=1 (rb1,i(nl) φ(nl))

i=1 E (rb1,1(nl) φ(nl))2

i=1 V ar (rb1,1(nl)) = σ2

Using (38) in (37) gives

P φ ˆd V C φd V C L δ 2R2σ2

MPOUDEU and CLARKE

in which the upper bound decreases as n increases because W(n) is increasing, thereby giving (34).

A notable diﬀerence between (34) and the corresponding theorem in Mc Donald et al. (2011) is that our simpliﬁed result eﬀectively only gives

P φ ˆd V C φd V C L δ = O 1

rather than O e γW for some γ > 0, a much faster rate. We conjecture that the more sophisticated techniques used in Mc Donald et al. (2011) could be adapted to our setting and thereby give an exponentially fast rate of convergence of ˆd V C to d V C in probability. However, as yet, we have not been able to show this. Also, although it is suppressed in the notation, our result implicitly requires m to justify the use of φd V C. Using Theorem 5, we can show that our ˆd V C is consistent. Suppose that φd V C( ) is κ-expansive, or simply expansive when κ is understood, i.e. nl, κ = κ (nl) so that κ(nl) d V C d V C φd V C (nl) φd V C (nl) , where κ(n), the expansion factor, is bounded on compact sets. Since the form of φd V C (nl) is known from (27), it is clear that the uniform expansivity condition we have assumed below actually holds, at least for appropriately chosen compact sets. We also observe that for c I there exists a neighborhood B (c, ϵl) , η > 0, on which (34) is true. Cover I H by sets of the form B (c, η) {d V C}; ﬁnitely many will be enough since I H is compact.

Theorem 6 : Given that the assumptions of Theorem 5 hold and that φd V C( ) is expansive, we have, as n , that

P ˆ d V C d V C δ 2R2σ2

, as Wn , (41)

where κ = q

1 L PL l=1 κ (nl) is the overall expansion factor.

Proof : Since all L of the φd V C (nl) s are at least locally expansive, their local expansivity inequalities can be summarized by an inequality of the form

d V C d V C

φd V C (nl) φd V C (nl) 2

= φd V C (nl) φd V C (nl) L,

where d V C is the true value and d V C is any other value in H, and any extra constant from the local expansion factors are assumed to have been absorbed into the κ (nl) s as needed.

1 L PL l=1 κ (nl). Using Theorem 5, and (42) we have

P ˆd V C d V C δ P h φ ˆd V C (nl) φd V C (nl) L δκ i

κδ2W , (43)

MODEL SELECTION VIA THE VC DIMENSION

where the last upper bound decreases as W = Wn as n , giving (41).

4. Numerical Comparisons

For any model, we can estimate the LHS of (22) from Theorem 2 by Algorithm #1 in Sec. 2.2. Then, we can use nonlinear regression in (29) to ﬁnd ˆd V C. So, it is seen that ˆd V C is a function of the conjectured model. In principle, for any given model class, the VC dimension can be found, so our method can be applied. Since our goal is to estimate the true VC dimension, when a conjectured model P ( | β) is linear and correct, we expect V C (P ( | β)) = ˆd V C. By the same logic, if P ( | β) is far from the true model, we expect V C (P ( | β)) ˆd V C or V C (P ( | β)) ˆd V C. This suggests we estimate d V C by seeking

ˆd V C = arg min k

V C (Pk ( | β)) ˆd V C,k , (44)

where {Pk ( | β) |k = 1, 2, , K} is some set of models and ˆd V C,k is calculated using model k, t is a positive and usually small number that such that t 2. In the case of linear models, with q = 1, 2, , Q explanatory variables, we get

ˆd V C = arg min q

q ˆd V C,q , (45)

where ˆd V C,q is the estimated VC dimension for model of size q. Note that (44) can identify a good model even when consistency fails. The reason is that (44) only requires a minimum at the VC dimension not convergence to the true VC dimension which may be any model under consideration. Here, to achieve uniqueness we use (45) and choose the smalest q achieving the smallest value of |q ˆd V C,q|, provided this makes sense in context. Our numerical work uses linear models, since for these we know the VC dimension equals the number of explanatory variables, see Anthony and Bartlett (2009). To establish notation, we write the regression function as a linear combination of the ﬁxed eﬀect covariates xj, j = 0, 1, , p,

y = f(x, β) = β0 + β1x1 + β2x2 + + βpxp =

Given a data set, {(xi, yi) , i = 1, 2, . . . , n}, the matrix representation is Y = Xβ + ϵ where Y is the n 1 vector of response values, X is the n (1 + p) matrix with rows (1, x1,1, x1,2, . . . , x1,p), β = (β0, β1, . . . , βp)T is the vector of model parameters, and ϵ = (ϵ1, ϵ2, . . . , ϵn)T is a n 1 mean zero Gaussian random vector. The least squares estimator bβ, assuming it exists, is given by bβ = X X 1 X Y.

4.1. Simulated data

We simulate data from

Y = β0x0 + β1x1 + β2x2 + + βpxp + ϵ,

MPOUDEU and CLARKE

where ϵ N(0, σϵ = 0.4) and

x0 = 1, βj N(µ = 5, σβ = 3), xj N(µ = 5, σx = 2), for j = 1, 2, p,

in which all the β s, x s and ϵ s are independently generated. We center and scale all our variables, including the response. For convenience, we use a nested sequence of model lists. If our covariates were highly correlated, before applying our method we could de-correlate them by sphering, i.e. transforming the covariates using their covariance matrix so they become approximately uncorrelated with variance one, see Murphy (2012) p. 144. In Subsec. 4.1.1, we present a typical simulation result to verify our estimator for VC dimension is consistent for the VC dimension of the true model. In Subsec. 4.1.2, we discuss simulations we have done where results do not initially appear to be consistent with the theory. First, large values of n are needed to get good performance with large values of p. Second as p increases, we must choose nl s that are properly spread out over [0, n].

4.1.1. A first example

We implemented simulations for a range of model sizes p = 15, 30, 40, 50, 60 and 70 and applied six model selection techniques AIC, BIC, CV, \ PERM 1, \ PERM 2, and VC dimension (VCD), see Mpoudeu and Clarke (2018) for details. We tended to use larger sample sizes with larger values of p and spaced the design points uniformly over [0, n], even though this may be suboptimal. We arbitrarily set m = 10 and W = 50. Our models were nested, including models that were too small and some that were too large, so that the estimate of d V C would uniquely specify a model. As a typical example, Fig. 1 shows the results for n = 700 and p = 70 with L = 7. When the size of the conjectured model is strictly less than the size of the true model, ˆd V C is equal to the smallest design point. However, when the conjectured model exactly matches the true model, ˆd V C 61, underestimating d V C. Interestingly, if we simply look at the minimal VCD value it occurs at the conjectured model of size 70, the true values of p. When the conjectured model is more complex than the true model, the VCD value is visibly higher than the VCD value for the true model. Thus, using VCD favors parsimony more than the other methods do. In results not shown here, we increased n to 2000 and used good design points (as discussed in Subsec. 4.1.2) and found ˆd V C 70. Our observation for the other ﬁve model selection methods is that they are less aﬀected by the small sample size, but decrease and ﬂatline in the sense that the AIC, BIC and CV values decrease very slowly (making it unclear which model to choose) while \ PERM 1 and \ PERM 2 routinely give models that are too small if one follows the usual rule of choosing the smallest local minimum. Overall, using VCD penalizes models that are too large more than other methods do, thereby giving a clearer statement about which model is true, at least in the limit with intelligently chosen design points. Other choices of n, p, the design points, and other inputs, gave results compatible with this interpretation. Although the diagrams are not shown here (but see Mpoudeu and Clarke (2018), Sec. 4) the smallest discrepancy between the size d V C of the model and ˆd V C usually occurs at the true model. This indicates that ˆd V C is consistent. In addition, even though the VCD values generally increase as the size of the conjectured model exceeds the size of the true model, in some cases, past a certain value d V C, the VCD value may ﬂatline as well. The

MODEL SELECTION VIA THE VC DIMENSION

60 65 70 75 80

PERM1, PERM2

Conjectured Model Size

VC Dimension

PERM1 PERM2 VCD

Complexity Methods

60 65 70 75 80

AIC and BIC

Conjectured Model Size

0.00 0.10 0.20

Cross Validation

Traditional Methods

p = 70 , n = 700 , n_l = 100 to 700 by 100

Figure 1: Upper: Values of \ PERM 1, \ PERM 2, and VC dimension. Lower: Values of AIC, BIC and CV for p = 70, σϵ = 0.4, σβ = 3, σx = 2.

MPOUDEU and CLARKE

problem with ˆd V C ﬂatlining (or decreasing) past a certain value of d V C occurs mostly due to instability, e.g., when n is not large enough relative to p. We argue that estimating VC dimension directly is better than using \ PERM 1 or \ PERM 2. There are several reasons. First, the computation of \ PERM 1 and \ PERM 2 requires ˆd V C. It also requires a threshold η be chosen (see Propositions 3 and 4) and is more dependent on m than ˆd V C is. Being more complicated than ˆd V C, \ PERM 1, \ PERM 2 will break down faster than ˆd V C. This is seen, for instance in tables of Mpoudeu (2017) Chap. 3 and the discussion there. More generally, we argue that \ PERM 1, and \ PERM 2 break down faster than ˆd V C with increasing p, if the sample size is held constant. That is, \ PERM 1 and \ PERM 2 are less eﬃcient than ˆd V C. Finally for this subsection, we reiterate our observation that in practice, when our VC dimension technique is used properly i.e., n is large enough relative to p, the σ used in Theorem 6 and the design points are adequately chosen, ˆd V C d V C has a well deﬁned minimum at the true value of d V C. Also, in contrast with other methods (including using shrinkage methods to nest models) our technique is generally more sensitive to over ﬁt, thereby giving better parsimony.

4.1.2. Dependence of ˆd CV on n and NL

It is no surprise that the higher n/p is the better the discrimination of ˆd V C over models is. However, our ﬁndings are more complicated because of the design points. The informal rule is that one wants n 10p for good parametric inference. However, this does not take into account model selection that often requires n > 10p. Indeed, for good model selection with d V C, we ﬁnd that n 15 is usually suﬃcient, provided the design points are not badly chosen. In our examples choosing the nl s uniformly over [0, n] generally gives decent but not optimal performance and for typical ranges of model sizes L 5 will suﬃce even though larger values of L are usually better, say 7 L 10. Although we are unable at this point to chracterize the tradeoﬀbetween design points and sample size, we have noted that in some cases, good choice of design points can compensate for insuﬃcient sample size. Indeed, relatively small changes in the nl s can have a large numerical eﬀect when n is small; possible due to instability in the nonlinear regression step, (29). We leave the question of optimally choosing L and the nl s as future work even though we make the following recommendations: 1) Good choices of nl s are spread over [0, n]. 2. More nl s should be in [n/2, n] than in [0, n

2 ], but neither should be empty. 3. Good choices for nl s tend to remain good while poor choices of nl s may not be as damaging to inference, as n . 4. It is better to use fewer design points over a larger range than more design points over a smaller range. 5. As n increases the nl s should shift upward, i.e., more in [n/2, n]. 6. If enough nl s are large relative to p, ˆd V C may be accurate for smaller n, perhaps n 12p will suﬃce.

4.2. Analysis of a Benchmark Data Set

The goal of this section is to evaluate our method on a benchmark data set Tour obtained from the Tour de France1. We start by giving some information about Tour, then in Sec.

1. Tour de France Data was compiled by B. Clarke. More information can be found at http://www.letour.fr/.

MODEL SELECTION VIA THE VC DIMENSION

4.2.1 we analyze it using a model list based on two of the explanatory variables. The list is a sequence of models nested by SCAD. We evaluate our method by comparing ˆd V C to AIC, BIC, CV, \ PERM 1, and \ PERM 2. In Subsec. 4.2.2, we look at the eﬀect of outliers in the estimates ˆd V C, \ PERM 1, and \ PERM 2. The full data set Tour has n = 103 data points. The data points are dependent (associated) because many cyclists competed in the Tour de France for more than one year. Here we ignore the dependence structure because the dependencies are small enough that the complication of accounting for them is not worthwhile. Each data point has a value of the response variable (Speed), the average speed in kilometers per hour (km/h) of the winner of the Tour. The explanatory variables are the Year (Y) of the Tour and its distance (D) in kilometers. Our data is from 1903 to 2016. However, during World Wars 1 and 2 there was no Tour, so we do not have data points for those years. The eﬀect of World War I on the speed of the winner of the tour can be seen in a scatterplot of Speed vs. Y the lowest winning speeds were in the years just after World War I, probably due to casualties. After World War II, there was also a decrease in average winning speed, but the decrease was less than after World War I. There is a curvilinear relationship between Speed and Year and a roughly linear relationship between Speed and D; the variability of Speed increases with D.

4.2.1. Analysis of Tour Using a Nested Collection of Models

We identify a nested model list using Y , D, Y 2, D2 and the interaction between Year and Distance denoted Y : D as covariates. Because the size of the data set is not large, we can only use a small model list. We order the variables using SCAD because as a shrinkage method it perturbs parameter estimates the least and satisﬁes an oracle property. Under SCAD, the order of inclusion of variables is Y , D, D2, Y 2, and Y : D. We therefore ﬁt ﬁve diﬀerent models. We use the six model selection techniques from Sec. 4.1 and include ˆd V C the original estimator in Vapnik et al. (1994) for the sake of comparison. It is seen that Vapnik et al. (1994) s original

Model ˆd V ˆd V C \ PERM 1 \ PERM 2 AIC BIC CV

Y 20 4 16.42 44.95 84 79.67 0.1294 Y, D 20 4 15.10 42.83 77 71.66 0.1209 Y, D, D2 20 4 11.21 36.37 24 26.21 0.0727 Y, D, D2, Y 2 20 4 11.09 36.16 17 28.77 0.0681 Y, D, D2, Y 2, Y : D 20 4 11.06 36.11 19 32.96 0.0691

Table 1: Model selection for the Tour de France data using seven methods. The design points for ˆd V and ˆd V C are 20, 30, 40, 50, 60, 70, 80, 90, and 100 and W = 50. So ˆd V equals the smallest design point in all cases. This problem frequently occurs for ˆd V .

method is helpful only if it is reasonable to surmise that there are 15 missing variables whereas our method uniquely identiﬁes one of the models on the list. Even though there is likely no model for Tour de France data set that is accurate to inﬁnite precision, our method is giving a useful result. Indeed, our method is choosing the fourth model list the same as

MPOUDEU and CLARKE

indicated by AIC and CV. The BIC drops the Y 2 term which is not unreasonable because the curvilinearity is less than quadratic. \ PERM 1 and \ PERM 2 include the interaction term (which can be seen to be zero by a simple t-test). It may also be the case that the derivation of the BIC rests heavily on the independent of data which is not the case here.

4.2.2. Analysis of The Tour de France data set with outliers removed

The observations just after World War I may be outliers. So, we consider the data set formed by deleting the points from 1919 to 1926. Let us see how the six model selection techniques now behave. The process of analyzing this reduced data set is the same: We identify the nested model lists by SCAD and then ﬁnd the models corresponding to ˆd V C, AIC, BIC, CV, \ PERM 1, and \ PERM 2. The results are given in Table 2.

Model Size bd V C \ PERM 1 \ PERM 2 AIC BIC CV

Y 4 12.87 40.72 67 75 0.1181 Y , D2 4 12.01 39.26 69 79 0.1336 Y , D2, D 4 11.66 38.36 46 59 0.0919 Y , D2, D, Y 2 4 11.48 38.34 28 43 0.0742 Y , D2, D, Y 2, Y : D 4 11.35 38.13 29 46 0.0735

Table 2: Model selection for the Tour de France data set with outliers removed.

Under SCAD, the order of inclusion of our covariates is: Y , D2, D, Y 2 and Y : D. This order is diﬀerent from when we used all data points. The outliers suggest Y : D was more important than it probably is. Note also that when we used all the data points, D was included before D2 and D2 was included after Y 2. Again, we ﬁt 5 nested models. From Table 2, if we choose a model using ˆd V C, we get the same answer as in Sec. 4.2.1, the model with four variables: Y , D2, D, Y 2. AIC and BIC choose the same model probably because (Y : D) has low correlation with Speed (-0.08). \ PERM 1, \ PERM 2 and CV choose the model of size 5, which we discount as before because Y : D is only slightly correlated with Speed. That is, the reasoning in Subsec. 4.2.1 for why we think that the model chosen by ˆd V C is best continues to hold.

4.3. Application to a Real Data Set

To demonstrate the use of our technique, we re-analyze the Wheat data set presented and studied in Campbell et al. (2003), Dilbirligi et al. (2006), and Dhungana et al. (2007) from a non-complexity based standpoint. The Wheat data set has 2912 observations. More information concerning the data set and the design structure can be found in Campbell et al. (2003). The response variable is YIELD (MG/ha), the covariates that we used are 1000 kernel weight (TKWT), kernels per spike (KPS), Spikes per square meter (SPSM), height of the plant (HT), test weight (TSTWT(KG/hl)), and kernels per square meter (KPSM). Often, in agronomic data sets, there are several classes of explanatory variables, here we have phenotypic, single nucleotide polymorphisms (SNP s) and the variables deﬁning the design. We compare VC dimension based model selection to the methods used in the last subsection and verify our method gives good results.

MODEL SELECTION VIA THE VC DIMENSION

Our collection of explanatory variables can be grouped into three categories: phenotype, SNP, and design variables. For brevity, we ﬁt only the phenotype variables in Subsec. 4.3.1 and phenotype plus design variables in Subsec. 4.3.2. Comparing these will show that the model selection is unaﬀected by the design. Also, for brevity, our only analysis is multilocation because we pooleded the data over location-year pairs. A full analysis is in Mpoudeu and Clarke (2018).

4.3.1. Estimation of VC Dimension Using Phenotypic Covariates Only

Intuition suggests Y IELD = β0 + β1 TKWT KPSM + ϵ (46)

will be a good model because YIELD is essentially the product of the number of kernels and their average weight. Likewise,

Y IELD = β0 + β1 TKWT KPS SPSM + ϵ (47)

should also be a good model. So, using only phenotpic variables does not lead to a unique good model. Both are over simpliﬁcations and we can be conﬁdent that other inﬂuences on YIELD must be considered. Indeed, a 3-dimensional plot of the vectors (YIELD, TKWT, KPSM) looks like a triangle that is bowed out to one side. The bowing means that (46) is only an approximation; other terms are required to explain YIELD. Henceforth, we focus on (46) rather than (47) because we have limited ourselves to second order models. To implement our multilocation analysis, we ﬁrst ﬁnd the order of inclusion of the phenotypic variables in the model, using correlation with YIELD. Then, we ﬁnd values for ˆd V C, AIC, BIC, CV, \ PERM 1, \ PERM 2, and the models given by SCAD and ALASSO. Under absolute value of correlation with YIELD, the order of inclusion of the explanatory variables is: TKWT KPSM, TSTWT KPSM, KPSM, SPSM KPS, KPSM2, KPSM HT, TKWT SPSM, TSTWT2, TSTWT, SPSM KPSM, KPS KPSM, TSTWT SPSM, SPSM SPSM HT, SPSM2, TKWT TSTWT, TKWT, TSTWT HT, TKWT2 TKWT KPS, TSTWT KPS, TKWT HT, KPS HT, HT, HT2 KPS, KPS2. So, we consider 27 nested models. This leads to Table 3. First, we note there is no variability in \ PERM 1 and \ PERM 2 so following standard usage, they select a model with TKWT KPSM as the only explanatory variable. Likewise, AIC, BIC, and CV suggest the one term model. However, ˆd V C = 14 means the VC dimension chooses the model with the ﬁrst 14 terms from the ordered list. SCAD and ALASSO both give the one term model

\ Y IELD = 3.43 + 1.12 TKWT KPSM, (48)

the same as \ PERM 1, \ PERM 2, AIC, BIC and CV. Thus the only reasonable model is the one chosen by ˆd V C.

4.3.2. Analysis of Wheat Using Phenotypic Data and the Design Structure

Our objective in this subsection is to take the design structure into account and see its impact on the values of the VC dimension and hence on the chosen model. As before, we implement a multilocation analysis. Including design variables forces us to use a more

MPOUDEU and CLARKE

Size ˆd V C \ PERM 1 \ PERM 2 AIC BIC CV

1 13 7 11 -9160 -9142 0.001801809 2 14 7 11 -9159 -9136 0.001802107 3 13 7 11 -9159 -9130 0.001802470 4 13 7 11 -9159 -9124 0.001803912 5 13 7 11 -9159 -9118 0.001803933 6 14 7 11 -9157 -9110 0.001805230 7 14 7 11 -9156 -9103 0.001805349 8 14 7 11 -9158 -9099 0.001806966 9 14 7 11 -9156 -9091 0.001807176 10 14 7 11 -9154 -9084 0.001808455 11 13 7 11 -9153 -9076 0.001808248 12 13 7 11 -9151 -9069 0.001808966 13 14 7 11 -9150 -9061 0.001810670 14 14 7 11 -9148 -9054 0.001812884 15 13 7 11 -9147 -9047 0.001813791

Table 3: The column labeled size gives the number of coeﬃcients for each linear model. The 2nd through 7th columns give the corresponding estimates for ˆd V C, \ ERM 1, \ ERM 2, AIC, BIC, and CV for Wheat.

complicated bootstrap procedure that would otherwise be suﬃcient. Thus, to implement our method here, we perform a restricted bootstrap. Speciﬁcally, we bootstrap in each level of the design variable (incomplete block) so that each half data set has all levels of the design structure. We do this to maintain the design structure and its eﬀects. To include phenotypic variables in the models, we use the same order of inclusion as in Subsec. 4.3.1. The natural comparison is between Tables 3 and 4. Apart from random variation, they are identical. Moreover, the sparsity methods give exactly the same results in both settings. Thus the conclusions here are the same as in Subsec. 4.3.1: The design variables have no impact on model selection and ˆd V C gives the only plausible model.

5. Conclusions

A concise summary of the contributions in this paper is as follows. Sec. 2.1 presents the derivation of the objective function we used to estimate the VC dimension. It is essentially an upper bound on the expected diﬀerence between two losses that we have deﬁned as m or , where the m indicates the discretization of the loss function for a regression problem. In Subsec. 2.2 we give an estimator for d V C that uses our upper bound, nonlinear regression treating sample sizes as design points, a data driven estimator of m, and an optimization over an arbitrary constant. While this sounds complex, in practice the computations can usually be done in minutes on a regular laptop. Even though we only have an upper bound on m, in Sec. 3 we are able to give conditions under which our estimator is consistent. This is circumstantial evidence that our upper bound is tight.

MODEL SELECTION VIA THE VC DIMENSION

Size ˆd V C \ PERM 1 \ PERM 2 AIC BIC CV

1 13 7 11 -9160 -9142 0.001801809 2 13 7 11 -9159 -9136 0.001802107 3 13 7 11 -9159 -9130 0.001802470 4 13 7 11 -9159 -9124 0.001803912 5 13 7 11 -9159 -9118 0.001803933 6 13 7 11 -9157 -9110 0.001805230 7 13 7 11 -9156 -9103 0.001805349 8 13 7 11 -9158 -9099 0.001806966 9 13 7 11 -9156 -9091 0.001807176 10 13 7 11 -9154 -9084 0.001808455 11 13 7 11 -9153 -9076 0.001808248 12 13 7 11 -9151 -9069 0.001808966 13 13 7 11 -9150 -9061 0.001810670 14 13 7 11 -9148 -9054 0.001812884

Table 4: The column labeled size gives the number of coeﬃcients for each linear model. The 2nd through 7th columns give the corresponding estimates for ˆd V C, \ PERM 1, \ PERM 2, AIC, BIC, and CV for multi-location analysis with design structure.

We have done an extensive comparison of our estimator of VC dimension as a model selection method with seven established model selection methods, namely two forms of empirical risk minimization, AIC, BIC, CV, and two sparsity criteria. Other examples can be found in Mpoudeu and Clarke (2018). We did this for the special case of linear models but the same reasoning can be used for any class of nonlinear models e.g., trees, for which the VC dimension can be identiﬁed. We also gave one example of how our estimator for VC dimension performs better than the original estimator in Vapnik et al. (1994). As a generality, our method equals or outperforms these other methods.

Acknowledgments

The authors gratefully acknowledge support from NSF grant # DMS-1419754 and invaluable computational support from the Holland Computing Center. Data, code, and results for the full versions of the analyses presented here can be found at https://github.com/ poudas1981/Wheat_data_set. We also express our gratitude to a sophisticated and hardworking referee and Editor who helped us improve our work immensely.

M. Anthony and P. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 2009.

A. Barron and T. Cover. Minimum complexity density estimation. IEEE Transactions on Information theory, 37(4):1034 1054, 1991.

MPOUDEU and CLARKE

J. Bernardo and A. F. M. Smith. Bayesian Theory. Wiley Series in Probability and Statistics, 2000.

B. Campbell, P. Baenziger, P. Stephen, K. Gill, K. Eskridge, H. Budak, M. Erayman, I. Dweikat, and Y. Yen. Identiﬁcation of qtls and environmental interactions associated with agronomic traits on chromosome 3a of wheat. Crop Science, 43:1493 1505, 2003.

P. Dhungana, K. Eskridge, P. Baenziger, B. Campbell, K. Gill, and I. Dweikat. Analysis of genotype-by-environment interaction in wheat using a structural equation model and chromosome substitution lines. Crop Science, 47:477 484, 2007.

M. Dilbirligi, M. Erayman, B. Campbell, H. Randhawa, P. Baenziger, I. Dweikat, and K. Gill. High-density mapping and comparative analysis of agronomically important traits on wheat chromosome 3a. Genomics, 88:74 87, 2006. ISSN 0888-7543. doi: https:// doi.org/10.1016/j.ygeno.2006.02.001. URL http://www.sciencedirect.com/science/ article/pii/S0888754306000437.

J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001. doi: 10. 1198/016214501753382273. URL http://dx.doi.org/10.1198/016214501753382273.

J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849 911, 2008.

D. Mc Donald, C. Shalizi, and M. Schervish. Estimated VC dimension for risk bounds. preprint ar Xiv:1111.3404, 2011.

M. Mpoudeu. Use of Vapnik-Chervonenkis Dimension in Model Selection. Ph D thesis, University of Nebraska-Lincoln, 2017. See: https://arxiv.org/pdf/1808.06684.pdf.

M. Mpoudeu and B. Clarke. Model selection via the VC dimension. See: https: // arxiv. org/ abs/ 1808. 05296 , 2018.

K. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

X. Shao, V. Cherkassky, and W. Li. Measuring the VC dimension using optimized experimental design. Neural computation, 12(8):1969 1986, 2000.

V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Soviet Math. Dokl, volume 9, pages 915 918, 1968.

V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies to their probabilities. In Soviet Math. Dokl, volume 9, pages 915 918, 1971.

V. Vapnik, E. Levin, and Y. Le Cun. Measuring the VC dimension of a learning machine. Neural Computation, 6(5):851 876, 1994.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.