# biasvariance_is_not_the_same_as_approximationestimation__84749889.pdf

Published in Transactions on Machine Learning Research (02/2024)

Bias/Variance is not the same as Approximation/Estimation

Gavin Brown gavin.brown@manchester.ac.uk Department of Computer Science, The University of Manchester

Riccardo Ali rma55@cam.ac.uk Department of Computer Science & Technology, The University of Cambridge

Reviewed on Open Review: https: // openreview. net/ forum? id= 4Tn Fbv16h K

We study the relation between two classical results: the bias-variance decomposition, and the approximation-estimation decomposition. Both are important conceptual tools in Machine Learning, helping us describe the nature of model fitting. It is commonly stated that they are closely related , or similar in spirit . However, sometimes it is said they are equivalent. In fact they are different, but have subtle connections cutting across learning theory, classical statistics, and information geometry, that (very surprisingly) have not been previously observed. We present several results for losses expressible as a Bregman divergence: a broad family with a known bias-variance decomposition. Discussion and future directions are presented for more general losses, including the 0/1 classification loss.

1 Introduction

Geman et al. (1992) introduced the bias-variance decomposition to the Machine Learning community, and Vapnik & Chervonenkis (1974) introduced the approximation-estimation decomposition, founding the field of statistical learning theory. Both decompositions help us understand model fitting: referring to model size, and some kind of trade-off. The terms are often used interchangeably. And yet, they are different things. The approximation-estimation decomposition refers to models drawn from some function class F, and considers an excess risk that is, the risk above that of the Bayes model breaking it into two components:

excess risk = approximation error + estimation error. (1)

We might choose to increase the size of our function class, perhaps by adding more parameters to our model. In this situation it is commonly understood that the approximation error will decrease, and the estimation error will increase (Von Luxburg & Schölkopf, 2011), beyond a certain point resulting in over-fitting of the model. In contrast to the abstract notion of a function class , the bias-variance decomposition considers the risk of real, trained models, in expectation over possible training sets. Assuming there is a unique correct response for each given input x (i.e., no noise) it breaks the expected risk into two components:

expected risk = bias + variance. (2)

As we increase model size: the bias tends to decrease, and the variance tends to increase, again determining the degree of over-fitting. Recently, it has become apparent that this trade-off is not always simple, e.g. with over-parameterised models; but, the decomposition still holds even if a simple trade-off does not. We note that this decomposition, as used in the Machine Learning literature, concerns inference of the response/target variable and not of the parameters, as is more common in classical statistics.

It is easy, and common, to conflate these decompositions. With a literature review (see Appendix C) one can observe innocent (but imprecise) statements such as, they are similar in spirit , but also the more extreme (and incorrect/misleading) the trade-off between estimation error and approximation error is often called the bias/variance trade-off . In contrast, we identify the precise relationships. We present detailed analysis for Bregman divergences in section 3, and offer discussion for more general losses in section 4.

Published in Transactions on Machine Learning Research (02/2024)

excess risk

approximation

expected risk

Overfitting Underfitting Overfitting Underfitting

Size of hypothesis space Number of parameters

Figure 1: Two diagrams (same on left/right is intentional) illustrating how the approximation/estimation and bias/variance trade-offs are commonly described, and easily confused.

2 Background

We introduce notation and review the two decompositions. We introduce these ideas in an intentionally didactic/comprehensive manner, to avoid any possibility of confusion in terminology.

2.1 Preliminaries

Consider a standard supervised learning setup, where the task is to map from an input x X Rd to an output y Y Rk, and assume there exists an unknown distribution P(x, y). This is achieved by learning a model f, which can also be seen as selecting a function f from a restricted function class F Fall, where Fall is the space of all measurable functions X Y. The discrepancy of f(x) from the true y is quantified with a loss ℓ(y, f(x)), which may or may not be symmetric. We define the risk of a model f as,

R(f) := Exy[ℓ(y, f(x))] = Z ℓ(y, f(x)) d P(x, y). (3)

A Bayes model y is a (not necessarily unique) function which minimizes this quantity at each x, i.e.

y arg inf f Fall R(f), (4)

where we follow Bach (2023, Prop 2.1) using the convention that the arginf returns a set if non-unique. We acknowledge a slight abuse of notation, using y as a function in Fall or a vector in Rk as needed the intention will always be made clear from context. Given that we picked a restricted family F Fall, we have no guarantee that it contains y . A best-in-family model f is defined similarly,

f arg inf f F R(f). (5)

These are defined in terms of the true distribution P(x, y). In practice, we only have a finite sample: n points (x1, y1), ..., (xn, yn) each drawn i.i.d. from P(x, y), as a realisation of the random variable D P(x, y)n. The empirical risk of a model f F is then defined:

Remp(f) := 1

i=1 ℓ(yi, f(xi)), (6)

and a model in F that minimizes this, known as an empirical risk minimizer (ERM), is defined:

ˆferm arg inf f F Remp(f). (7)

Note that ˆferm is a random variable, as it is dependent on D. The empirical risks are the same for any two ERMs, but their population risks may be different. We can now cover the specifics for the two decompositions.

Published in Transactions on Machine Learning Research (02/2024)

2.2 The Approximation-Estimation decomposition

The approximation-estimation decomposition is a seminal observation from the 1970s work of Vapnik and Chervonenkis, reviewed in Vapnik (1999). An excellent historical account can be found in Bottou (2013). The result deals with the excess risk R( ˆferm) R(y ), i.e. the risk of ˆferm above that of the Bayes model, y . The approximation-estimation decomposition, applicable for any loss ℓ, breaks this into two terms:

R( ˆferm) R(y ) | {z } excess risk

= R( ˆferm) R(f ) | {z } estimation error

+ R(f ) R(y ). | {z } approximation error

The approximation error is the additional risk due to using a restricted family F, rather than the space of all functions Fall. This is a systematic quantity, not dependent on any particular data sample. The estimation error is the additional risk due to our finite training data, when trying to find f F. This a random variable, dependent on the particular data sample. There is a natural trade-off (see Figure 1, left) as we change the size of F, keeping data size fixed. As we increase |F|, approximation error will likely decrease (potentially to zero, if y F), but estimation error will increase, as it becomes harder to find f

in the larger space. The reason behind this is, in effect, the classical multiple hypothesis testing problem we cannot reliably distinguish many hypotheses when our dataset is small. Bottou & Bousquet (2007) extended Equation 8, recognising that it is often intractable to find a global minimum of the training risk, and we can only have a sub-optimal model ˆf. An additional risk component then emerges, and the excess risk of ˆf now decomposes into a sum of optimisation error, estimation error, and approximation error:

R( ˆf) R(y ) | {z } excess risk of ˆ f

= R( ˆf) R( ˆferm) | {z } optimisation error

+ R( ˆferm) R(f ) | {z } estimation error

+ R(f ) R(y ). | {z } approximation error

These three terms describe the learning process in abstract form: accounting respectively for the choice of learning algorithm, the quality/amount of data, and the capacity of the model family.

2.3 The Bias-Variance decomposition

A bias-variance decomposition involves the expected risk of a trained model ˆf, where the expectation ED is over the random variable D P(x, y)n, i.e., all possible training sets of a fixed size n. Focusing on a squared loss, and y R, Geman et al. (1992) showed:

ED h Exy[(y ˆf(x))2] i

| {z } expected risk

= Exy h (y y )2i

| {z } noise

+ Ex h y ED[ ˆf(x)] 2 i

| {z } bias

+ Ex h ED[( ˆf(x) ED[ ˆf(x)])2] i

| {z } variance

where y = Ey|x[y] is the Bayes-optimal prediction at each point x. The bias is a systematic component, independent of any particular training sample, and commonly regarded as measuring the strength of a model. The variance measures the sensitivity of ˆf to changes in the training sample, independent of the true label y. The noise is a constant, independent of any model parameters. There is again a perceived trade-off with these terms (see Figure 1, right). As the size of the (un-regularised) model increases: bias tends to decrease, and variance tends to increase. However, the trade-off can be more complex (e.g. with over-parameterized models) and the exact dynamics are an open research issue.

Bias-Variance decompositions hold for more than just squared loss. In fact the same form holds for several other losses, including the broad family of Bregman divergences (Bregman, 1967).

Definition 1 (Bregman divergence) For a convex set Y Rk, let ϕ : Y R be a strictly convex function, continuously differentiable on the interior of Y, which is assumed non-empty. The Bregman divergence Bϕ : Y ri(Y) [0, ) is defined, for points p Y and q ri(Y), as

Bϕ(p, q) = ϕ(p) ϕ(q) ϕ(q), p q (11)

The function ϕ is conventionally referred to as a generator function the choice of which leads to different well-known losses, e.g. ϕ(y) = y2 gives Bϕ(y, f) = (y f)2. We refer the reader to Banerjee et al. (2005b) for an excellent tutorial on Bregman divergences.

Published in Transactions on Machine Learning Research (02/2024)

Key to understanding generalised bias-variance decompositions is the notion of a centroid. Nielsen & Nock (2009) provide a thorough characterisation for the centroids of Bregman divergences.

Definition 2 (Left-sided Bregman centroid) Assume a Bregman divergence Bϕ : Y ri(Y) [0, ), with generator ϕ : Y R. Define a set of models of the form ˆf : X ri(Y), induced by a random variable D, then the left-sided Bregman centroid of the model distribution is:

fϕ(x) := arg min z Y ED h Bϕ(z, ˆf(x)) i = [ ϕ] 1 ED h ϕ( ˆf(x)) i . (12)

If we choose ϕ(y) = y2, then Bϕ(y, f) = (y f)2, and

fϕ(x) = ED[ ˆf(x)], but this is not always the case. The left1 centroid is in general a quasi-arithmetic mean, is unique (Nielsen & Nock, 2009, Theorem 3.2), and guaranteed to exist (Nielsen & Nock, 2020, Theorem 1). Examples of Bregman centroids are below.

Table 1: Examples of Bregman divergences, with corresponding left centroids.

Name Domain Bϕ(y, ˆf(x)) Centroid

Squared y R (y ˆf(x))2 ED[ ˆf(x)] KL-divergence y Rk, s.t. P

c yc = 1 DKL(y || ˆf(x)) Z 1 exp(ED[ln ˆf(x)]) Ikatura-Saito y [0, ) y ˆ f(x) ln y ˆ f(x) 1 1/ED[ ˆf(x) 1]

A bias-variance decomposition for Bregman divergences was shown by Pfau (2013), taking the form:

ED h Exy[Bϕ(y, ˆf(x))] i

| {z } expected risk

= Exy h Bϕ(y, y ) i

| {z } noise

+ Ex h Bϕ(y ,

| {z } bias

+ Ex h ED[Bϕ(

fϕ(x), ˆf(x))] i . | {z } variance

fϕ(x) is the left centroid, and y = Ey|x[y] is the Bayes model (Banerjee et al., 2005b, Prop. 1). We note that given ˆf : X ri(Y), and the fact that ϕ is a homeomorphism (by the invariance of domain theorem), this implies

fϕ(x) ri(Y). The bias/variance terms take different functional forms for each loss. This has a consequence for nomenclature: the term in Equation 10 is sometimes called squared bias . But, the square is an artefact from using squared loss, not present in other cases, hence we use simply bias . Note that the KL example implies a decomposition for the cross-entropy, since the two differ only by a constant. It is interesting to note that generalised decompositions only appeared in the ML community with Heskes (1998), but the idea seems to be known much earlier in statistics, e.g., Hastie & Tibshirani (1986, Eq. 19).

Bias-Variance decompositions do not hold for all losses. The approximation-estimation decomposition, Equation 9, applies for any loss. This is not the case for the bias-variance decomposition. For example, the form of Equation 13 does not hold for the 0/1 loss in this case, the variance term becomes dependent on the label distribution (Friedman, 1997). Several authors proposed alternative decompositions (Wolpert, 1997; James & Hastie, 1997; Heskes, 1998; Domingos, 2000). It is interesting to note that the concept of the loss centroid also occurs in this literature, referred to as the systematic or main prediction (Geurts, 2002). The necessary and sufficient conditions for such a decomposition are an open research question.

2.4 Summary

These decompositions are conceptual tools to describe the nature of model fitting. They are by no means perfect reflections of the process, most especially in the context of over-parameterized models (Nagarajan & Kolter, 2019; Zhang et al., 2021). However, it is extremely common to see papers making the incorrect assumption/claim that the two are equivalent, or that one is a special case of the other. Our purpose with this work is to correct these false assumptions, identifying precisely how the two connect.

1Note that this is a called a left centroid because the minimization is over the first (left-hand) argument. The right centroid can be similarly defined by minimizing over the second argument, turning out to be simply ED[ ˆf(x)] for any valid ϕ (Banerjee et al., 2005a), which explains why the Bayes-optimal prediction is y = Ey|x[y] for any Bregman divergence.

Published in Transactions on Machine Learning Research (02/2024)

3 Bias/Variance is not the same as Approximation/Estimation

By now it should be evident that these decompositions are related, but are not quite the same thing. Perhaps the most obvious difference is that they are on different quantities the excess risk of an ERM, versus the expected risk of an arbitrary trained model. We now build a bridge between the two, using Bregman divergences to include a wide range of losses. We first define a concept, building on Definition 2, that we will refer to repeatedly in the coming sections: the centroid model .

Definition 3 (Centroid model) For a model ˆf dependent on a random variable D, the centroid model

fϕ is the aggregate model formed by taking the left Bregman centroid prediction at each possible x. Note that whilst by definition

fϕ Fall, there is no guarantee that

We now observe that the estimation error involves R( ˆferm), making it a random variable dependent on D. We take the expectation with respect to D and separate it into two, using the risk of the centroid model.

Definition 4 (Estimation Bias, and Estimation Variance) For a Bregman divergence Bϕ, the expected estimation error can be decomposed to expose two terms: the estimation bias, and estimation variance.

ED h R( ˆferm) R(f ) i

| {z } expected estimation error

= ED h R( ˆferm) R(

| {z } estimation variance

fϕ) R(f ). | {z } estimation bias

The estimation variance measures the random variations of ˆferm around the centroid model. The estimation bias measures the systematic difference between the centroid model and the best-in-family model. Using these concepts, we can present the relation between the two decompositions.

Theorem 1 (Bias-Variance in terms of Approximation-Estimation) Given a Bregman divergence Bϕ(y, f(x)), the following decomposition of the bias and variance applies.

Ex h Bϕ(y ,

| {z } bias

= R(f ) R(y ) | {z } approximation error

fϕ) R(f ) | {z } estimation bias

Ex h ED[Bϕ(

fϕ(x), ˆf(x))] i

| {z } variance

= ED h R( ˆf) R( ˆferm) i

| {z } optimisation error

+ ED h R( ˆferm) R(

| {z } estimation variance

This confirms the premise of our work. Bias is not approximation error, and variance is not estimation error. It is not even the case that one is a special case of the other, as is sometimes stated. The true relation is more subtle. The approximation error is in fact just one component of the bias, and, the estimation error contributes to both bias and variance. The theorem above is illustrated in Figure 2.

optimization

expected risk

error + approximation

error + + estimation

variance bias noise

Figure 2: Illustration of Theorem 1. The bias is only partly determined by approximation error (i.e. choice of model), while the rest is due to expected estimation error (i.e. choice of data). Similarly, variation in data accounts for only part of the variance, and the rest is due to optimisation error (i.e. choice of algorithm).

Published in Transactions on Machine Learning Research (02/2024)

4 Discussion

A simplistic description of bias and variance would say they are the error due to the model (bias) and the error due to the data (variance). Theorem 1 shows there is more nuance to understand. We now discuss the subtleties and implications of these results. For ease of referral we denote: Eapp (approximation error), Eest (expected estimation error), Eest(b) (estimation bias), and Eest(v) (estimation variance).

4.1 The case of linear least squares

Assume a linear model ˆf(x) = x T ˆβ, and a squared loss. The parameters ˆβ are fit using a dataset {X, y}, where X is an n d matrix, and y is a column vector of length n. As before, this training data is a realisation of the random variable D. The ridge regression solution is ˆβλ := [XT X+λI] 1XT y where λ = 0 is OLS. The best-in-family model f , with F = {f(x) = x T β | β Rd}, uses parameters β which minimise the squared risk, i.e., β := arg min ˆβ Exy[(y x T ˆβ)2] = Ex[xx T ] 1Exy[xy]. Furthermore, due to the unbiasedness of

OLS, ED[ˆβ0] = β . In this scenario, the bias-variance decomposition is,

ED h Exy h (y x T ˆβλ)2ii

| {z } expected risk

= Exy h (y y )2i

| {z } noise

+ Ex h (y ED[x T ˆβλ])2i

| {z } bias

+ Ex h ED h (x T ˆβλ ED[x T ˆβλ])2ii

| {z } variance

(17) Hastie et al. (2017, Eq 7.14) describe2 how the bias term decomposes more finely:

Ex h (y ED[x T ˆβλ])2i

| {z } bias

= Ex h (y x T β )2i

| {z } Hastie s model bias

+ Ex h (x T β ED[x T ˆβλ])2i

| {z } Hastie s estimation bias

Though they refer to the first term as model bias , it turns out that is exactly equal to the approximation error for a linear model. Similarly, their estimation bias , whilst written differently, is exactly the estimation bias we have defined, but for a linear model. These observations are formalized in the following theorem.

Theorem 2 The "model bias/estimation bias" decomposition (Equation 18) is a special case of our bias decomposition (Equation 15) for the specific case of a linear model with squared loss.

We also note that, due to the unbiasedness of OLS, we have both Eopt = 0 and Eest(b) = 0. Thus, for this special case, bias is equal to the approximation error, and variance is equal to (what remains of) the estimation error. The decompositions are (numerically) equivalent in the OLS scenario, but in general they are very different. This is most clear in the general case behaviour of the estimation bias, discussed next.

4.2 The bias is a flawed proxy for model capacity.

It is common to assume the bias is an indication of how simple/complex a model is expected to be lower if the model has higher capacity . But what is model capacity ? If we define it as the ability to minimize population risk, then the ultimate measure of model capacity is the approximation error. We see in Equation 15 that the bias contains exactly this, but also the estimation bias, which gives it some surprising dynamics. As detailed above, Hastie et al s observations were restricted to the linear case for analytic tractability. The linear model assumption meant that they were unable to observe a critical fact that in the general case, the estimation bias Eest(b) = R(

fϕ) R(f ), can take negative values, i.e.

bias = h approximation error

| {z } always 0

+ hestimation bias

| {z } can be negative

To understand how this can be, we must accept the somewhat non-intuitive idea that the centroid model can be outside the hypothesis class F, and thus we can have R(

fϕ) < R(f ). This can be trivially illustrated, with a simple regression stump evaluated by squared loss, in Figure 3.

2It is likely that Hastie et al were not the first to observe this, but it is a commonly known reference for the statement.

Published in Transactions on Machine Learning Research (02/2024)

y = 6.1 y = 9

x 1.6 x < 1.6

y = 7 y = 9.8

x 2.1 x < 2.1

Stump 1 Stump 2 Stump 2

Figure 3: Two regression stumps (red/blue lines), and their centroid model (black line, arithmetic mean). Notice the centroid model is outside the hypothesis class, i.e. it cannot be represented as a single binary stump. As a result, the centroid model fits the data better than any f F, and Eest(b) is negative.

The possibility of negative values here has significant implications. There are two ways in which bias can be zero. If F contains the Bayes model, then we might have Eapp = Eest(b) = 0. But, there is another way. For some ϵ > 0, we might have Eapp = ϵ, and Eest(b) = ϵ. In this case, the model family does not have sufficient capacity, since Eapp > 0. And yet, the bias is zero. Hence, the bias is a flawed proxy for the true model capacity. To illustrate this, we show experiments on a synthetic problem. Details in Appendix B.

Figure 4 shows results increasing the depth of a decision tree. The left panel shows excess risk, and the bias/variance components. We observe the classical bias/variance trade-off, including overfitting, as the depth increases beyond a certain point. It is notable that the bias decreases to zero, after depth 6. Does this imply the model is unbiased , in the sense that it has sufficient capacity to capture the full data distribution?

Figure 4: Risk components as we increase the depth of a regression tree.

The answer is no. A decomposition of the bias into two components (right panel) shows that the Eapp is non-zero, i.e. the best possible model cannot achieve zero testing error. The cause of the bias going to zero is that Eest(b) is negative, hence the bias is not a good proxy for the true model capacity. Similar results are obtained with a k-nn regression (Figure 5), where increasing complexity corresponds to decreasing k.

Figure 5: Risk components as we decrease the number of neighbours in a k-nn.

Published in Transactions on Machine Learning Research (02/2024)

We can formally characterise this phenomenon, by studying the geometry of the hypothesis class F. In particular, if the set F is dual-convex (Amari, 2008, Equation 32) with respect to ϕ, then

fϕ F, and hence estimation bias is guaranteed to be non-negative.

Theorem 3 (Sufficient condition for a non-negative estimation bias.) If the hypothesis class F is dual-convex then the estimation bias is non-negative.

A simple example of a non-dual convex set is the class of regression stumps evaluated by squared loss, where fϕ(x) = ED[ ˆf(x)], illustrated in Figure 3. A simple example of a dual-convex set is the class of Generalized Linear Models evaluated by their corresponding deviance measure.

Theorem 4 (GLMs have non-negative estimation bias.) For a Bregman divergence with generator function ϕ, define F as the set of all GLMs with inverse link [ ϕ] 1 and natural parameters θ Rd. Then, the estimation bias is non-negative.

An example of this would be a logistic regression, ˆf(x) = [ ϕ] 1(ˆθT x) = 1/(1 + exp( ˆθT x)), which results from ϕ(f) = f ln f + (1 f) ln(1 f) and the binary KL is the corresponding Bregman divergence.

4.3 The estimation variance plays a role in double descent.

In recent literature, an increasing degree of over-parameterisation has been associated with a peaking trend in the variance (Nakkiran, 2019; Yang et al., 2020), ultimately causing a double descent in the risk.

Figure 6: Illustration of double descent, caused by a peaking variance (red line) and monotonically decreasing bias (blue line). Image credit Yang et al. (2020).

Such models often fit their training data perfectly (Belkin et al., 2019; Zhang et al., 2021), i.e., they interpolate the data. If we consider this in the context of Equation 16, we see that:

" estimation variance

" optimisation error

| {z } 0 for interpolating models

i.e., the optimization error is close to zero. This observed peaking variance must therefore be primarily due to the estimation variance, Eest(v). Furthermore, very deep models are likely to be able to fit any function, i.e., their approximation error is zero. In these scenarios, the only terms remaining in the expected risk are Eest(b) and Eest(v). Why such models can push training error to zero, even on random labels, and still generalise well, remains an open question for modern machine learning (Zhang et al., 2021). Overall, we believe this warrants further study in the context of deep models.

Published in Transactions on Machine Learning Research (02/2024)

4.4 New insights into the bias/variance trade-off.

In recent years, the relevance (and even existence) of a trade-off between bias and variance has been debated, with voices both against (Neal et al., 2018; Dar et al., 2021) and in favour (Witten, 2020). However with estimation bias and estimation variance, in certain circumstances, we observe that the trade-off is an indisputable fact. We first note that Eest(b) + Eest(v) = Eest 0. Thus, when one of these quantities is negative, the other is forced to be positive and lower bounded by the magnitude of the negative term. For example, when the estimation bias is negative (e.g. Figure 3), it obviously reduces the bias. However this implies that the variance increases, since Eest(v) Eest(b), to satisfy the constraint. This is not the case for Hastie et al s terms, using a linear model / squared loss, since this is a (generalized) linear model and its corresponding deviance measure, as in Theorem 4. However, negative estimation bias is entirely possible in other scenarios, implying the variance (and indeed overall expected risk) is positive. This is an unavoidable trade-off. Clearly, other components of the bias/variance may mask this behaviour, making it less obvious.

4.5 What if a bias-variance decomposition doesn t hold?

As mentioned earlier, the form of Equation 13 does not hold for all losses, e.g., 0/1 loss. Many authors have proposed alternative 0/1 decompositions, each of which makes compromises and has different properties. The decomposition proposed by (James, 2003) applies for several losses, and links neatly to our work. They limit their commentary to symmetric losses however, their conclusions do apply more generally, as we will see. As in our work, they rely on the notion of a centroid3 model. For an arbitrary symmetric loss ℓ, they could have defined the centroid in two ways: minimizing over the left or right argument:

f LEFT(x) := arg min z Y ED h ℓ(z, ˆf(x)) i ,

f RIGHT(x) := arg min z Y ED h ℓ( ˆf(x), z) i . (20)

They chose the right centroid, which unfortunately masks the generality of their conclusions. This can be seen by instead adopting the left centroid, as we did for Bregman divergences. We now describe their framework, but using

f LEFT. From this point onwards, we assume a loss ℓsuch that the centroid always exists, but make no assumptions on its uniqueness.

If ℓwas the 0/1 loss,

f is the mode of the distribution. For the absolute loss, it is the median value. If ℓwas a Bregman divergence, it is the left Bregman centroid,

fϕ, defined earlier. Using this, we define two terms:

bias-effect := R(

f) R(y ), (21)

variance-effect := ED h R( ˆf) R(

f) i . (22)

These quantify the effect on the risk of using one predictor versus another. The bias-effect is the change in risk for the centroid model versus the Bayes model. The variance-effect is the change in risk for a model ˆf versus the centroid model, averaged over the distribution of D. We then have the decomposition:

ED h R( ˆf) i

| {z } expected risk

= R(y ) | {z } noise

f) R(y ) | {z } bias-effect

+ ED[R( ˆf) R(

f)] | {z } variance-effect

which can be verified by allowing terms on the right to cancel. James (2003) notes that with squared loss, the bias-effect is equal to the bias, and the variance-effect is equal to the variance: thus Equation 23 reduces to Geman et al. (1992). They state this relation holds for any symmetric loss, but in fact (as a trivial4 corollary to Theorem 1) we see it holds for any Bregman divergence which are all asymmetric, except squared loss. Thus, if ℓis a Bregman divergence, Equation 23 reduces to Equation 13. We can relate the terms above to the approximation-estimation decomposition, using the same overall strategy as before.

3They refer to this as the systematic part of the predictor, however we retain our terminology/notation for consistency. 4For bias, allow R(f ) to cancel on the right of Equation 15, and for variance, allow R( ˆferm) to cancel in Equation 16.

Published in Transactions on Machine Learning Research (02/2024)

Proposition 1 (Bias/Variance Effects, in terms of Approximation-Estimation) For any loss ℓ, assuming a centroid model exists, we have the following decomposition of the bias-effect and variance-effect.

f) R(y ) | {z } bias-effect

= R(f ) R(y ) | {z } approximation error

f) R(f ), | {z } estimation bias

ED[R( ˆf) R(

f)] | {z } variance-effect

= ED h R( ˆf) R( ˆferm) i

| {z } optimisation error

+ ED h R( ˆferm) R(

f) i . | {z } estimation variance

And the full relation to our earlier observations can be illustrated as follows.

Equality only when a BV

decomposition holds Equality for any loss

bias bias-effect approximation

variance variance-effect estimation

variance optimisation

Figure 7: Relations between several decompositions that we have considered in this work.

As mentioned, for 0/1 loss the centroid is the modal value of the predictions. Taking the mode is effectively a weighted majority vote across the distribution of predictions from ˆf. Weighted voting classifiers have been extensively studied in the context of Boosting (Schapire, 2003), where it is well-known that a voted combination of weak (half-plane linear) models results a non-linear decision boundary. This implies

f / F, and thus again it is possible for estimation bias to be negative. Further characterisation of the terms in Figure 7, for the 0/1 loss, or indeed the general case of any loss, would therefore be desirable.

5 Conclusions

We analysed the precise connections between two seminal results that are often conflated: the bias-variance decomposition, and the approximation-estimation decomposition. Perhaps the most surprising aspect of this work was that it had not been explored before two such foundational ideas, not previously connected. There are of course several excellent sources which do not conflate them, e.g., Györfi et al. (2002), but to the best of our knowledge there is no work comparing/contrasting the decompositions. In a literature review (see Appendix C), we found numerous sources stating the two were equivalent. This is false. The true relation, given by Theorem 1, is more intricate, and yielded interesting novel observations, including links to the phenomenon of double descent in deep learning. We focused on Bregman divergences, but also briefly considered the case of more general losses, where a bias-variance decomposition does not hold, e.g., 0/1 loss. In this case the geometry of such losses is not well-understood, leaving several open issues. In all cases, the centroid model turned out to be a key mathematical object in bridging the decompositions. We conjecture that further study of this object, and its role in generalisation, may yield yet deeper and interesting insights.

Acknowledgements

Funding in direct support of this work: EPSRC EP/N035127/1 (LAMBDA project).

Published in Transactions on Machine Learning Research (02/2024)

Shun-ichi Amari. Information geometry and its applications: Convex function and dually flat manifold. In LIX Fall Colloquium on Emerging Trends in Visual Computing, pp. 75 102. Springer, 2008.

Francis Bach. Learning theory from first principles. Draft version: Dec 25th, 2023, 2023. URL https: //www.di.ens.fr/~fbach/ltfp_book.pdf.

Arindam Banerjee, Xin Guo, and Hui Wang. On the optimality of conditional expectation as a bregman predictor. IEEE Transactions on Information Theory, 51(7):2664 2669, 2005a.

Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. Clustering with bregman divergences. Journal of machine learning research, 6(10), 2005b.

Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine learning, 14:115 133, 1994.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proc. National Academy of Sciences, 116(32):15849 15854, 2019.

Léon Bottou. In Hindsight: Doklady Akademii Nauk SSSR, 181(4), 1968, pp. 3 5. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 978-3-642-41136-6. doi: 10.1007/978-3-642-41136-6_1. URL https: //doi.org/10.1007/978-3-642-41136-6_1.

Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. Advances in neural information processing systems, 20, 2007.

Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Mathematics, 7(3):200 217, 1967.

Chih-Chieh Chen, Masaru Sogabe, Kodai Shiba, Katsuyoshi Sakamoto, and Tomah Sogabe. General vapnik chervonenkis dimension bounds for quantum circuit learning. Journal of Physics: Complexity, 3(4), 2022.

Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1 49, 2002.

Yehuda Dar, Vidya Muthukumar, and Richard G Baraniuk. A farewell to the bias-variance tradeoff? an overview of the theory of overparameterized machine learning. ar Xiv preprint ar Xiv:2109.02355, 2021.

Hal Daumé. A Course in Machine Learning (2nd printing, Jan 2017). Online, 2017. URL http://ciml. info/dl/v0_99/ciml-v0_99-all.pdf.

Pedro Domingos. A unified bias-variance decomposition. In Proceedings of 17th international conference on machine learning, pp. 231 238. Morgan Kaufmann Stanford, 2000.

Yann Dubois, Tatsunori Hashimoto, and Percy Liang. Evaluating self-supervised learning via risk decomposition. In International Conference on Machine Learning, volume 202, 2023.

Jianqing Fan, Cong Ma, and Yiqiao Zhong. A selective overview of deep learning. Statistical science: a review journal of the Institute of Mathematical Statistics, 36(2):264, 2021.

Jerome H Friedman. On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1:55 77, 1997.

Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1 58, 1992.

Pierre Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. Ph D thesis, University of Liège Belgium, 2002.

László Györfi, Michael Kohler, Adam Krzyzak, Harro Walk, et al. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.

Published in Transactions on Machine Learning Research (02/2024)

Trevor Hastie and Robert Tibshirani. Generalized Additive Models. Statistical Science, 1(3):297 310, 1986. doi: 10.1214/ss/1177013604. URL https://doi.org/10.1214/ss/1177013604.

Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 12th printing, January 13th 2017. Springer, 2017.

Anne-Claire Haury. Feature selection from gene expression data : molecular signatures for breast cancer prognosis and gene regulation network inference. Theses, Ecole Nationale Supérieure des Mines de Paris, December 2012. URL https://pastel.hal.science/pastel-00818345.

Tom Heskes. Bias/variance decompositions for likelihood-based estimators. Neural Computation, 10(6): 1425 1433, 1998.

Gareth James and Trevor Hastie. Generalizations of the bias/variance decomposition for prediction error. Technical report, Dept. Statistics, Stanford Univ., Stanford, CA, 1997.

Gareth M James. Variance and bias for general loss functions. Machine learning, 51:115 135, 2003.

Tin-Yau Kwok and Dit-Yan Yeung. Use of bias term in projection pursuit learning improves approximation and convergence properties. IEEE Transactions on Neural Networks, 7(5):1168 1183, 1996.

Jonathan N Lee, George Tucker, Ofir Nachum, Bo Dai, and Emma Brunskill. Oracle inequalities for model selection in offline reinforcement learning. Neural Information Processing Systems, 35:28194 28207, 2022.

Yunwen Lei, Lixin Ding, and Wensheng Zhang. Generalization performance of radial basis function networks. IEEE transactions on neural networks and learning systems, 26(3):551 564, 2014.

Marie H Masson, Stéphane Canu, Yves Grandvalet, and Anders Lynggaard-Jensen. Software sensor design based on empirical data. Ecological Modelling, 120(2-3):131 139, 1999.

Astrid Merckling. Unsupervised Pretraining of State Representations in a Rewardless Environment. Theses, ISIR, Université Pierre et Marie Curie UMR CNRS 7222, September 2021. URL https://theses.hal. science/tel-03562230.

Somabha Mukherjee, Rohit K Patra, Andrew L Johnson, and Hiroshi Morita. Least squares estimation of a quasiconvex regression function. Journal of Royal Statistical Society (B) Statistical Methodology, 2023.

Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.

Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent. ar Xiv preprint ar Xiv:1912.07242, 2019.

Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. ar Xiv preprint ar Xiv:1810.08591, 2018.

Frank Nielsen and Richard Nock. Sided and symmetrized bregman centroids. IEEE transactions on Information Theory, 55(6):2882 2904, 2009.

Frank Nielsen and Richard Nock. Corrigendum and Addendum to: Sided and Symmetrized Bregman centroids IEEE Transactions on Information Theory 55.6 (2009): 2882-2904., 2020. URL https:// franknielsen.github.io/Corrigendum Addendum Symmetrized Bregman Centroid.pdf.

Partha Niyogi. The informational complexity of learning from examples. Ph D thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1995. URL https://hdl.handle.net/1721.1/36990.

Partha Niyogi and Federico Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Computation, 8(4):819 842, 1996.

Published in Transactions on Machine Learning Research (02/2024)

David Pfau. A Generalized Bias-Variance Decomposition for Bregman Divergences. Technical report, Columbia University, 2013.

Tomaso Poggio, Steve Smale, et al. The mathematics of learning: Dealing with data. Notices of the AMS, 50(5):537 544, 2003.

Robert E Schapire. The boosting approach to machine learning: An overview. Nonlinear estimation and classification, pp. 149 171, 2003.

Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.

Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition. Nauka, Moscow, 1974.

Ulrike Von Luxburg and Bernhard Schölkopf. Statistical learning theory: Models, concepts, and results. In Handbook of the History of Logic, volume 10, pp. 651 706. Elsevier, 2011.

Huiyuan Wang and Wei Lin. Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off. Ar Xiv preprint, 2023. URL https://arxiv.org/pdf/2106.04795.pdf.

Shuoyang Wang, Guanqun Cao, Zuofeng Shang, and for the Alzheimer s Disease Neuroimaging Initiative. Estimation of the mean function of functional data via deep neural networks. Stat, 10(1):e393, 2021. doi: https://doi.org/10.1002/sta4.393.

Daniela Witten. Twitter thread: The Bias-Variance Trade-Off & "DOUBLE DESCENT", 2020. URL https://x.com/daniela_witten/status/1292293102103748609. Posted 3.54am, 9th August, 2020.

David H Wolpert. On bias plus variance. Neural Computation, 9(6):1211 1243, 1997.

Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. Rethinking bias-variance trade-off for generalization of neural networks. In International Conf. on Machine Learning, 2020.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Published in Transactions on Machine Learning Research (02/2024)

A Proofs of Theorems

A.1 Proof of Theorem 1 (Bias-Variance in terms of Approximation-Estimation).

We wish to prove the following statements:

Ex h Bϕ(y ,

| {z } bias

= R(f ) R(y ) | {z } approximation error

fϕ) R(f ) | {z } estimation bias

Ex h ED h Bϕ(

fϕ(x), ˆf(x)) ii

| {z } variance

= ED h R( ˆf) R( ˆferm) i

| {z } optimisation error

+ ED h R( ˆferm) R(

| {z } estimation variance

To show Equation 26, we note that the R(f ) terms cancel, so we just need to prove:

Ex h Bϕ(y ,

fϕ(x)) i = R(

fϕ) R(y ). (28)

The proof below builds on the Bregman 3-point property (Nielsen & Nock, 2009).

Definition (Bregman three-point identity) The Bregman three-point property states, for any p, q, r,

Bϕ(p, r) = Bϕ(p, q) + Bϕ(q, r) + p q , ϕ(q) ϕ(r) (29)

We then have the following, where we apply the three-point property to y,

fϕ, with y as the mid-point.

fϕ) = Bϕ(y, y ) + Bϕ(y ,

fϕ) + y y , ϕ(y ) ϕ(

Take the expected value w/r p(y|x) and the inner product term vanishes, since y = Ey|x[y]. Rearranging terms and further taking expectation w/r x, we recover:

fϕ) R(y ) = Ex h Bϕ(y ,

which is the desired result, proving Equation 26.

To show Equation 27, we follow a similar pattern. Take the 3-point property for y, ˆf with

fϕ as the mid-point.

Bϕ(y, ˆf) = Bϕ(y,

fϕ, ˆf) + y

fϕ) ϕ( ˆf) (32)

Take the expected value w/r D and the inner product term vanishes, since ϕ(

fϕ) = ED h ϕ( ˆf) i .

Rearranging terms and further taking expectation over p(x), we recover:

ED h R( ˆf) R(

fϕ) i = Ex h ED h Bϕ(

which is the desired result, completing the theorem.

Published in Transactions on Machine Learning Research (02/2024)

Special case of Theorem 1 for squared loss. The following presents the special case of squared loss, included for didactic purposes due to its ubiquity and links to the results for linear models. We wish to prove the following statements:

Ex h (ED[ ˆf(x)] Ey|x[y])2i = Eapp + Eest(b) (34)

Ex h ED[( ˆf(x) ED[ ˆf(x)])2] i = Eopt + Eest(v) (35)

Exy[(y Ey|x[y])2] = R(y ) (36)

To show Equation 36, we simply note that y = Ey|x[y], so the expression is true by definition.

To show Equation 34 we note, as an intermediate step, that:

Eapp + Eest(b) = R(f ) R(y ) + R(ED[ ˆf]) R(f ) = R(ED[ ˆf]) R(y ). (37)

We then have the following, again using the definition of y .

R(ED[ ˆf]) R(y ) = Exy h (ED[ ˆf] y)2i Exy (y Ey|x[y])2 .

ED[ ˆf] 2 2y ED[ ˆf] Ey|x[y]2 + 2y Ey|x[y]

= Ex h ED[ ˆf] 2 2Ey|x[y]ED[ ˆf(x)] Ey|x[y]2 + 2Ey|x[y]2i

= Ex h ED[ ˆf] 2 2Ey|x[y]ED[ ˆf(x)] + Ey|x[y]2i

= Ex h (ED[ ˆf] Ey|x[y])2i ,

which is the bias, and the desired result.

To show Equation 35, we follow a similar pattern. From definitions:

Eopt + Eest(v) = ED h R( ˆf) R( ˆferm) i + ED h R( ˆferm) R(ED[ ˆf]) i = ED h R( ˆf) R(ED[ ˆf]) i . (38)

We then have the following.

ED h R( ˆf) R(ED[ ˆf]) i = ED h Exy h ( ˆf y)2i Exy h (ED[ ˆf] y)2ii

= ED h Exy h ˆf 2 2y ˆf ED[ ˆf]2 + 2y ED[ ˆf] ii

= Exy h ED[ ˆf 2] 2y ED[ ˆf] ED[ ˆf]2 + 2y ED[ ˆf] i

= Ex h ED[ ˆf 2] ED[ ˆf]2i

= Ex h ED[( ˆf ED[ ˆf])2] i

where the final step is the standard definition of variance, giving the desired result.

Published in Transactions on Machine Learning Research (02/2024)

A.2 Proof of Theorem 2 (Hastie s model bias / estimation bias is a special case).

Hastie et al. (2017, Equation 7.14) observe:

Ex h (y ED[x T ˆβλ])2i

| {z } bias

= Ex h (y x T β )2i

| {z } Hastie s model bias

+ Ex h (x T β ED[x T ˆβλ])2i

| {z } Hastie s estimation bias

We will show this is a special case of our Equation 15, restated here for y R:

Ex h Bϕ(y ,

| {z } bias

= R(f ) R(y ) | {z } approximation error

fϕ) R(f ) | {z } estimation bias

We first show Hastie s model bias is the approximation error of a linear model using squared loss, i.e.,

R(f ) R(y ) | {z } approximation error (generic form)

= Ex h (y x T β )2i

| {z } Hastie s model bias (squared loss, linear model)

For any Bregman divergence Bϕ, the approximation error R(f ) R(y ) can be written as follows.

R(f ) R(y ) = Exy[Bϕ(y , f (x)]. (42)

i.e. the approximation error is equal to the Bregman divergence of y from f (x), in expectation over P(xy).

Proof sketch. Use the 3-point theorem in exactly the same manner as in the proof of Theorem 1, i.e. between y, f with y as the mid-point:

Bϕ(y, f ) = Bϕ(y, y ) + Bϕ(y , f ) + y y , ϕ(y ) ϕ(f ) . (43)

Then take expectation successively over P(y|x) then P(x). The inner product term is zero, since Ey|x[y] = y . For a linear model, f (x) = x T β . Rearrange the remaining terms: using squared loss, we have the result.

The result for estimation bias is proven similarly. For squared loss,

fϕ = ED[x T ˆβλ], so we will show:

fϕ) R(f ) | {z } estimation bias (generic form)

= Ex h (x T β ED[x T ˆβλ])2i

| {z } Hastie s estimation bias (squared loss, linear model)

Proof sketch. Estimation bias can be written (using the Bregman 3-point theorem) as so:

fϕ) = Bϕ(y, f ) + Bϕ(f ,

fϕ) + y f , ϕ(f ) ϕ(

We take squared loss and a linear model, ˆf(x) = x T ˆβλ, then take expectation over P(x, y), and we have,

Exy[(y ED[x T ˆβλ])2] = Exy[(y x T β )2] + Exy[(x T β ED[x T ˆβλ])2], (46)

fϕ) R(f ) = Exy[(x T β ED[x T ˆβλ])2]. (47)

where we note that in Equation 45, the cross term Exy h y x T β ϕ(x T β ) ϕ(ED[x T ˆβλ]) i = 0.

This can be shown as follows. We note that ϕ(z) = 2z, and β = Ex[xx T ] 1Exy[xy].

Exy h y x T β ϕ(x T β ) ϕ(ED[x T ˆβλ]) i

= Exy h (y x T β ) 2x T β (y x T β ) 2ED[x T ˆβλ] i

= 2Exy h (y x T β ) x T β i

2Exy h (y x T β ) ED[x T ˆβλ] i

Published in Transactions on Machine Learning Research (02/2024)

Taking the first term, ignoring the constant 2, we have:

Exy h (y x T β ) x T β i = Exy h yx T βT xx T i β (51)

= Exy[yx T ] βT Ex[xx T ] β (52)

= Exy[yx T ]

h Ex[xx T ] 1Exy[xy] i T

Ex[xx T ] β (53)

= Exy[yx T ] Exy[xy]T Ex[xx T ] 1Ex[xx T ] β (54)

= Exy[yx T ] Exy[yx T ] β (55)

Follow the same steps for the second term, with ED[ˆβλ] in place of β .

A.3 Proof of Theorem 3 (Sufficient condition for a non-negative estimation bias).

To prove Theorem 3, we demonstrate that under a certain condition,

fϕ F, which implies R(

fϕ) R(f ), and therefore R(

fϕ) R(f ) 0. We use the following definition, due to Amari (2008, Equation 32).

Definition 5 (Dual convex set) Let ϕ be a strictly convex function. A set F is dually convex with respect to ϕ iff, for any pair of points f, g F and for all λ [0, 1]

λ ϕ(f) + (1 λ) ϕ(g) F

i.e. the set F is dual-convex iff it is convex in its dual coordinate representation.

An arbitrary set C is convex iff for any random variable X defined over elements of C, its expectation is also in C, i.e. E[X] C.

Therefore, for a dual convex set F, we have that the point ED[ ϕ(f)] F. The primal coordinate representation of this point, ϕ 1(ED[ ϕ(f)]), is also a member of F, i.e.

fϕ F, proving the theorem.

A.4 Proof of Theorem 4 (GLMs have non-negative estimation bias).

We demonstrate that Eest(b) 0 if ˆf is a GLM of a particular form. We give two proofs: a direct one and one that makes use of Theorem 3.

Direct proof. The estimation bias is defined:

Eest(b) = R(

fϕ) R(f ). (57)

This involves the definition of the centroid prediction, which for a Bregman divergence is,

fϕ(x) := [ ϕ] 1 ED h ϕ( ˆf(x)) i . (58)

Given a Bregman divergence with generator ϕ, define F as the class of GLMs with inverse link [ ϕ] 1, parameterised by θ Rd. In this case, each ˆf F takes the form:

ˆf(x) := [ ϕ] 1 θT x , (59)

Published in Transactions on Machine Learning Research (02/2024)

where θ are the natural parameters. Substituting this into the centroid prediction gives us,

fϕ(x) = [ ϕ] 1 ED h ϕ

[ ϕ] 1 θT x i ,

= [ ϕ] 1 ED [θ]T x . (60)

Since ED[θ] is within the convex hull of the distribution of θ induced by D, the centroid prediction is the same form of GLM as ˆf(x), for all x, and therefore the centroid model

fϕ F. Then, since by definition f

is the risk minimizer in F, we must have that R(

fϕ) R(f ), and therefore Equation 57 is non-negative.

Proof using Theorem 3. To show that the estimation bias is non-negative, it suffices to show that the class of GLMs of a particular form is dually-convex. We verify that the property of dual-convexity holds. Define F = GLMs with inverse link ϕ 1.

By definition, if f F, it is parameterised by a vector θ as follows: f(x) = ϕ 1(θT x).

Let h be the function, expressed in primal coordinates, corresponding to the convex combination of two arbitrary GLMs in their dual coordinates, i.e. h = ϕ 1(λ ϕ(f) + (1 λ) ϕ(g)), with λ [0, 1], and with f and g two GLMs f = ϕ 1(θT x) and g = ϕ 1(ξT x). We need to show that h F. But

h = ϕ 1(λ ϕ(f) + (1 λ) ϕ(g)))

= ϕ 1(λ ϕ( ϕ 1(θT x)) + (1 λ) ϕ( ϕ 1(ξT x))))

= ϕ 1(λθT x + (1 λ)ξT x)

= ϕ 1((λθT + (1 λ)ξT )x)

which is again a GLM in F.

Published in Transactions on Machine Learning Research (02/2024)

B Experimental details

We summarise our methodology to generate the illustrative experiments shown in the paper.

Full code is available at https://github.com/profgavinbrown/ondecompositions

We use a synthetic 1-d problem: x [0, 15], and the true label is y = x + 5 sin(2x) + ϵ, where ϵ is Gaussian noise with zero mean and σ = 3. Training data is n = 100 points, illustrated below.

Figure 8: Synthetic problem for experiments.

Since this is a regression problem, ℓ(y, f(x)) = (y f(x))2, and

fϕ(x) := ED[ ˆf(x)].

The function class F is defined as the set of all trained models obtained over T = 1000 independently sampled datasets, each of size n = 100. The best-in-class model is the minimum across the T trials:

f := arg min D ˆR( ˆf D) (61)

where the risk R(f) is approximated by sample of uniformly sampled points at a resolution of 0.001, giving a total of n = 15, 000 test points. To simplify analysis, we assume ˆf = ˆferm.

Published in Transactions on Machine Learning Research (02/2024)

C Example Literature Conflating the Decompositions

We provide evidence from several sources that the decompositions are often conflated. This includes recent papers in Neur IPS, ICML, the Journal of Physics, Transactions on Neural Networks, and MIT lecture notes.

C.1 Published work

In a popular online textbook, Daumé (2017, Section 5.9) states

The trade-off between estimation error and approximation error is often called the bias/variance trade-off, where approximation error is bias and estimation error is variance.

In Neur IPS, Lee et al. (2022) state:

Model selection is a fundamental task in supervised learning and statistical learning theory. Given a sequence of model classes, the goal is to optimally balance the approximation error (bias) and estimation error (variance)

In the Journal of the Royal Statistical Society, Mukherjee et al. (2023) state:

A bound of this type [...] describes the bias-variance or the approximation-estimation trade-off [...] the bias ( approximation ) in (15) is zero and the variance ( estimation ) term determines the estimation error

In the Journal of Physics, Chen et al. (2022) state:

To achieve low prediction error in supervised learning, the approximation-estimation tradeoff (also known as the bias-variance trade-off) should be considered

In the Bulletin of the AMS, Cucker & Smale (2002) (1984 Google Scholar citations, Feb 2024) state

Then, typically, the approximation error will decrease when enlarging H, but the sample error will increase. This latter feature is sometimes called the bias-variance trade-off [...] The bias is the approximation error and the variance is the sample error.

In Machine Learning Journal, Barron (1994) (1050 Google Scholar citations, Feb 2024) states

one can deal effectively with the total risk of the estimation of functions, including both the approximation error (bias) and the estimation error (variance)

In IEEE Trans. Neural Networks, Kwok & Yeung (1996) states

This is however not the case for R in the absence of a bias term, [...] and thus the approximation error (i.e., bias) cannot be made as small as desired by trading variance.

In IEEE Trans. Neural Networks, Lei et al. (2014) state

To see this, we identify two factors determining the model s generalization performance by recalling the following bias-variance decomposition [...] The first term is often called the estimation error, while the second is the approximation error [24], [28].

Published in Transactions on Machine Learning Research (02/2024)

In ISI Stat, Wang et al. (2021) state:

Hence, we follow the conventional approximation estimation decomposition (or bias variance trade-off) to decompose the empirical norm

The next equation they state is the approximation-estimation decomposition for squared loss.

In the Journal of Ecological Modelling, Masson et al. (1999) state:

The more flexible the model is, the greater is its ability to approach any function, but the more instable is the estimation problem from a finite amount of data. This is known as the approximation/estimation or bias/variance tradeoff.

In a Ph D thesis, Merckling (2021) states

The two terms Eapp and Eest constitute the approximation-estimation tradeoff (a.k.a. biasvariance tradeoff) where high bias is similar to high approximation error known as underfitting, and high variance is similar to high estimation error known as overfitting.

Poggio et al. (2003) states:

The decomposition of equation (12) is indirectly related to the well-known bias and variance decomposition in statistics.

More generally, however, there is a trade-off between minimizing the sample error and minimizing the approximation error what we referred to as the bias-variance problem.

Though technically correct, this is a slightly misleading use of language.

Niyogi & Girosi (1996), also in a 1995 MIT Ph D thesis (Niyogi, 1995) state

As the number of parameters (proportional to n) increases, the bias (which can be thought of as analogous to the approximation error) of the estimator decreases and its variance (which can be thought of as analogous to the estimation error) increases for a fixed size of the data set. Finding the right bias-variance trade-off is very similar in spirit to finding the trade-off between network complexity and data complexity.

Wang & Lin (2023) states:

These empirical findings deeply challenge the conventional wisdom that optimal generalization should be achieved by trading off bias (or approximation error) and variance (or estimation error).

In ICML 2023, Dubois et al. (2023)

In supervised learning, one can get more fine-grained insights using the estimation/approximation (or bias/variance) risk decomposition,

The estimation/approximation or the bias/variance decomposition has been very useful for practitioners and theoreticians to focus on specific risk components

though in the Appendix of the same article they state the approximation-estimation tradeoff (or the related bias-vias tradeoff) . [sic, including typo]

Published in Transactions on Machine Learning Research (02/2024)

Fan et al. (2021) state:

We follow the conventional approximation-estimation decomposition (sometimes, also biasvariance tradeoff)

The next equation they state is the approximation-estimation decomposition.

In a Ph D thesis, Haury (2012) states:

Figure 1.3: Approximation error and estimation error. The error made when choosing [a model] can be seen as the sum of bias and variance. The bias refers to the approximation error and the variance to the estimation error.

C.2 Online materials conflating the decompositions

At the time of writing this article, all material was available at the URLs below. As these are not archived in perpetuity, we cannot guarantee availability in the future.

New York University:

Slide 30 states Approximation error = bias , and Estimation error = variance . https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/1b. intro-slt-riskdecomp.pdf

Module 9.520 lecture slides 17-19 use the title Bias-Variance Tradeoff but proceed to discuss the approximation-estimation decomposition. https://www.mit.edu/~9.520/fall18/slides/Class14_SL.pdf

University of Wisconsin:

This decomposition into stochastic and approximation errors is similar to the bias-variance tradeoff which arises in classical estimation theory: the approximation error is like a bias squared term, and the estimation error is like a variance term. https://nowak.ece.wisc.edu/SLT09/lecture3.pdf

Reddit Data Science discussion forum:

A couple people have been confused by the exact terminology. I should clarify that biasvariance decomposition is technically different than the approximation-estimation error decomposition. But they are extremely similar, and in most cases they are mathematically equivalent. In fact, it is useful to think of the approximation-estimation decomp as a sub-case of the bias-variance decomposition, if we make the assumption that our training algorithm is expected to output the best model in its hypothesis class. If this assumption can be made, then they become mathematically equivalent most intents and purposes. It s important to note that most modern class machine learning algorithms and classes satisfy this assumption, so they are equivalent. https://www.reddit.com/r/datascience/s/uj FSz3r Yq O