# algorithmic_stability_and_uniform_generalization__3ccd874e.pdf

Algorithmic Stability and Uniform Generalization

Ibrahim Alabdulmohsin King Abdullah University of Science and Technology Thuwal 23955, Saudi Arabia ibrahim.alabdulmohsin@kaust.edu.sa

One of the central questions in statistical learning theory is to determine the conditions under which agents can learn from experience. This includes the necessary and sufﬁcient conditions for generalization from a given ﬁnite training set to new observations. In this paper, we prove that algorithmic stability in the inference process is equivalent to uniform generalization across all parametric loss functions. We provide various interpretations of this result. For instance, a relationship is proved between stability and data processing, which reveals that algorithmic stability can be improved by post-processing the inferred hypothesis or by augmenting training examples with artiﬁcial noise prior to learning. In addition, we establish a relationship between algorithmic stability and the size of the observation space, which provides a formal justiﬁcation for dimensionality reduction methods. Finally, we connect algorithmic stability to the size of the hypothesis space, which recovers the classical PAC result that the size (complexity) of the hypothesis space should be controlled in order to improve algorithmic stability and improve generalization.

1 Introduction

One fundamental goal of any learning algorithm is to strike a right balance between underﬁtting and overﬁtting. In mathematical terms, this is often translated into two separate objectives. First, we would like the learning algorithm to produce a hypothesis that is reasonably consistent with the empirical evidence (i.e. to have a small empirical risk). Second, we would like to guarantee that the empirical risk (training error) is a valid estimate of the true unknown risk (test error). The former condition protects against underﬁtting while the latter condition protects against overﬁtting.

The rationale behind these two objectives can be understood if we deﬁne the generalization risk Rgen by the absolute difference between the empirical and true risks: Rgen .= Remp Rtrue . Then, it is elementary to observe that the true risk Rtrue is bounded from above by the sum Remp + Rgen. Hence, by minimizing both the empirical risk (underﬁtting) and the generalization risk (overﬁtting), one obtains an inference procedure whose true risk is minimal.

Minimizing the empirical risk alone can be carried out using the empirical risk minimization (ERM) procedure [1] or some approximations to it. However, the generalization risk is often impossible to deal with directly. Instead, it is a common practice to bound it analyticaly so that we can establish conditions under which it is guaranteed to be small. By establishing conditions for generalization, one hopes to design better learning algorithms that both perform well empirically and generalize well to novel observations in the future. A prominent example of such an approach is the Support Vector Machines (SVM) algorithm for binary classiﬁcation [2].

However, bounding the generalization risk is quite intricate because it can be approached from various angles. In fact, several methods have been proposed in the past to prove generalization bounds including uniform convergence, algorithmic stability, Rademacher and Gaussian complexities, generic chaining bounds, the PAC-Bayesian framework, and robustness-based analysis

[1, 3, 4, 5, 6, 7, 8, 9]. Concentration of measure inequalities form the building blocks of these rich theories.

The proliferation of generalization bounds can be understood if we look into the general setting of learning introduced by Vapnik [1]. In this setting, we have an observation space Z and a hypothesis space H. A learning algorithm, henceforth denoted L : m=1 Zm H, uses a ﬁnite set of observations to infer a hypothesis H H. In the general setting, the inference process end-to-end is inﬂuenced by three key factors: (1) the nature of the observation space Z, (2) the nature of the hypothesis space H, and (3) the details of the learning algorithm L. By imposing constraints on any of these three components, one may be able to derive new generalization bounds. For example, the Vapnik-Chervonenkis (VC) theory derives generalization bounds by assuming constraints on H, while stability bounds, e.g. [6, 10, 11, 12], are derived by assuming constraints on L.

Given that different generalization bounds can be established by imposing constraints on any of Z, H, or L, it is intriguing to ask if there exists a single view for generalization that ties all of these different components together. In this paper, we answer this question in the afﬁrmative by establishing that algorithmic stability alone is equivalent to uniform generalization. Informally speaking, an inference process is said to generalize uniformly if the generalization risk vanishes uniformly across all bounded parametric loss functions at the limit of large training sets. A more precise deﬁnition will be presented in the sequel. We will show why constraints that are imposed on either H, Z, or L to improve uniform generalization can be interpreted as methods of improving the stability of the learning algorithm L. This is similar in spirit to a result by Kearns and Ron, who showed that having a ﬁnite VC dimension in the hypothesis space H implies a certain notion of algorithmic stability in the inference process [13]. Our statement, however, is more general as it applies to all learning algorithms that fall under Vapnik s general setting of learning, well beyond uniform convergence.

The rest of the paper is as follows. First, we review the current literature on algorithmic stability, generalization, and learnability. Then, we introduce key deﬁnitions that will be repeatedly used throughout the paper. Next, we prove the central theorem, which reveals that algorithmic stability is equivalent to uniform generalization, and provide various interpretations of this result afterward.

2 Related Work

Perhaps, the two most fundamental concepts in statistical learning theory are those of learnability and generalization [12, 14]. The two concepts are distinct from each other. As will be discussed in more details next, whereas learnability is concerned with measuring the excess risk within a hypothesis space, generalization is concerned with estimating the true risk.

In order to deﬁne learnability and generalization, suppose we have an observation space Z, a probability distribution of observations P(z), and a bounded stochastic loss function L( ; H) : Z [0, 1], where H H is an inferred hypothesis. Note that L is implicitly a function of (parameterized by) H as well. We deﬁne the true risk of a hypothesis H H by the risk functional: Rtrue(H) = EZ P(z) L(Z; H) (1) Then, a learning algorithm is called consistent if the true risk of its inferred hypothesis H converges to the optimal true risk within the hypothesis space H at the limit of large training sets m . A problem is called learnable if it admits a consistent learning algorithm [14]. It has been known that learnability for supervised classiﬁcation and regression problems is equivalent to uniform convergence [3, 14]. However, Shalev-Shwartz et al. recently showed that uniform convergence is not necessary in Vapnik s general setting of learning and proposed algorithmic stability as an alternative key condition for learnability [14].

Unlike learnability, the question of generalization is concerned primarily with how representative the empirical risk Remp is to the true risk Rtrue. To elaborate, suppose we have a ﬁnite training set Sm = {Zi}i=1,..,m, which comprises of m i.i.d. observations Zi P(z). We deﬁne the empirical risk of a hypothesis H with respect to Sm by:

Remp(H; Sm) = 1

Zi Sm L(Zi; H) (2)

We also let Rtrue(H) be the true risk as deﬁned in Eq. (1). Then, a learning algorithm L is said to generalize if the empirical risk of its inferred hypothesis converges to its true risk as m .

Similar to learnability, uniform convergence is, by deﬁnition, sufﬁcient for generalization [1], but it is not necessary because the learning algorithm can always restrict its search space to a smaller subset of H (artiﬁcially so to speak). By contrast, it is not known whether algorithmic stability is necessary for generalization. It has been shown that various notions of algorithmic stability can be deﬁned that are sufﬁcient for generalization [6, 10, 11, 12, 15, 16]. However, it is not known whether an appropriate notion of algorithmic stability can be deﬁned that is both necessary and sufﬁcient for generalization in Vapnik s general setting of learning. In this paper, we answer this question by showing that stability in the inference process is not only sufﬁcient for generalization, but it is, in fact, equivalent to uniform generalization, which is a notion of generalization that is stronger than the one traditionally considered in the literature.

3 Preliminaries

To simplify the discussion, we will always assume that all sets are countable, including the observation space Z and the hypothesis space H. This is similar to the assumptions used in some previous works such as [6]. However, the main results, which are presented in Section 4, can be readily generalized. In addition, we assume that all learning algorithms are invariant to permutations of the training set. Hence, the order of training examples is irrelevant.

Moreover, if X P(x) is a random variable drawn from the alphabet X and f(X) is a function of X, we write EX P(x) f(X) to mean P

x X P(x) f(x). Often, we will simply write EX f(X) to mean EX P(x) f(X) if the distribution of X is clear from the context. If X takes its values from a ﬁnite set S uniformly at random, we write X S to denote this distribution of X. If X is a boolean random variable, then I{X} = 1 if and only if X is true, otherwise I{X} = 0. In general, random variables are denoted with capital letters, instances of random variables are denoted with small letters, and alphabets are denoted with calligraphic typeface. Also, given two probability mass functions P and Q deﬁned on the same alphabet A, we will write P, Q to denote the overlapping coefﬁcient, i.e. intersection, between P and Q. That is, P, Q .= P

a A min{P(a), Q(a)}. Note that P, Q = 1 ||P , Q||T , where ||P , Q||T is the total variation distance. Last, we will write B(k; φ, n) = n k φk (1 φ)n k to denote the binomial distribution.

In this paper, we consider the general setting of learning introduced by Vapnik [1]. To reiterate, we have an observation space Z and a hypothesis space H. Our learning algorithm L receives a set of m observations Sm = {Zi}i=1,..,m Zm generated i.i.d. from a ﬁxed unknown distribution P(z), and picks a hypothesis H H with probability PL(H = h|Sm). Formally, L : m=1 Zm H is a stochastic map. In this paper, we allow the hypothesis H to be any summary statistic of the training set. It can be a measure of central tendency, as in unsupervised learning, or it can be a mapping from an input space to an output space, as in supervised learning. In fact, we even allow H to be a subset of the training set itself. In formal terms, L is a stochastic map between the two random variables H H and Sm Zm, where the exact interpretation of those random variables is irrelevant.

In any learning task, we assume a non-negative bounded loss function L(Z; H) : Z [0, 1] is used to measure the quality of the inferred hypothesis H H on the observation Z Z. Most importantly, we assume that L( ; H) : Z [0, 1] is parametric: Deﬁnition 1 (Parametric Loss Functions). A loss function L( ; H) : Z [0, 1] is called parametric if it is independent of the training set Sm given the inferred hypothesis H. That is, a parametric loss function satisﬁes the Markov chain: Sm H L( ; H).

For any ﬁxed hypothesis H H, we deﬁne its true risk Rtrue(H) by Eq. (1), and deﬁne its empirical risk on a training set Sm, denoted Remp(H; Sm), by Eq. (2). We also deﬁne the true and empirical risks of the learning algorithm L by the expected risk of its inferred hypothesis: ˆRtrue(L) = ESm EH PL(h|Sm) Rtrue(H) = ESm EH|Sm Rtrue(H) (3) ˆRemp(L) = ESm EH PL(h|Sm) Remp(H; Sm) = ESm EH|Sm Remp(H; Sm) (4)

To simplify notation, we will write ˆRtrue and ˆRemp instead of ˆRtrue(L) and ˆRemp(L). We will consider the following deﬁnition of generalization: Deﬁnition 2 (Generalization). A learning algorithm L : m=1 Zm H with a parametric loss function L( ; H) : Z [0, 1] generalizes if for any distribution P(z) on Z, we have limm | ˆRemp ˆRtrue = 0, where ˆRtrue and ˆRemp are given in Eq. (3) and Eq. (4) respectively.

In other words, a learning algorithm L generalizes according to Deﬁnition 2 if its empirical performance (training loss) becomes an unbiased estimator to the true risk as m . Next, we deﬁne uniform generalization: Deﬁnition 3 (Uniform Generalization). A learning algorithm L : m=1 Zm H generalizes uniformly if for any ϵ > 0, there exists m0(ϵ) > 0 such that for all distributions P(z) on Z, all parametric loss functions, and all sample sizes m > m0(ϵ), we have | ˆRemp(L) ˆRtrue(L) ϵ.

Uniform generalization is stronger than the original notion of generalization in Deﬁnition 2. In particular, if a learning algorithm generalizes uniformly, then it generalizes according to Deﬁnition 2 as well. The converse, however, is not true. Even though uniform generalization appears to be quite a strong condition, at ﬁrst sight, a key contribution of this paper is to show that it is not a strong condition because it is equivalent to a simple condition, namely algorithmic stability.

4 Main Results

Before we prove that algorithmic stability is equivalent to uniform generalization, we introduce a probabilistic notion of mutual stability between two random variables. In order to abstract away any labeling information the random variables might possess, e.g. the observation space may or may not be a metric space, we deﬁne stability by the impact of observations on probability distributions: Deﬁnition 4 (Mutual Stability). Let X X and Y Y be two random variables. Then, the mutual stability between X and Y is deﬁned by:

S(X; Y ) .= P(X) P(Y ), P(X, Y ) = EX P(Y ), P(Y |X) = EY P(X), P(X|Y )

If we recall that 0 P, Q 1 is the overlapping coefﬁcient between the two probability distributions P and Q, we see that S(X; Y ) given by Deﬁnition 4 is indeed a probabilistic measure of mutual stability. It measures how stable the distribution of Y is before and after observing an instance of X, and vice versa. A small value of S(X; Y ) means that the probability distribution of X or Y is heavily perturbed by a single observation of the other random variable. Perfect mutual stability is achieved when the two random variables are independent of each other.

With this probabilistic notion of mutual stability in mind, we deﬁne the stability of a learning algorithm L by the mutual stability between its inferred hypothesis and a random training example. Deﬁnition 5 (Algorithmic Stability). Let L : m=1 Zm H be a learning algorithm that receives a ﬁnite set of training examples Sm = {Zi}i=1,..,m Zm drawn i.i.d. from a ﬁxed distribution P(z). Let H PL(h|Sm) be the hypothesis inferred by L, and let Ztrn Sm be a single random training example. We deﬁne the stability of L by: S(L) = inf P(z) S(H; Ztrn), where the inﬁmum is taken over all possible distributions of observations P(z). A learning algorithm is called algorithmically stable if limm S(L) = 1.

Note that the above deﬁnition of algorithmic stability is rather weak; it only requires that the contribution of any single training example on the overall inference process to be more and more negligible as the sample size increases. In addition, it is well-deﬁned even if the learning algorithm is deterministic because the hypothesis H, if it is a deterministic function of an entire training set of m observations, remains a stochastic function of any individual observation. We illustrate this concept with the following example: Example 1. Suppose that observations Zi {0, 1} are i.i.d. Bernoulli trials with P(Zi = 1) = φ, and that the hypothesis produced by L is the empirical average H = 1

m Pm i=1 Zi. Because P(H = k/m Ztrn = 1) = B(k 1; φ, m 1) and P(H = k/m Ztrn = 0) = B(k; φ, m 1), it can be shown using Stirling s approximation [17] that the algorithmic stability of this learning algorithm is asymptotically given by S(L) 1 1

2 π m, which is achieved when φ = 1/2. A more general statement will be proved later in Section 5.

Next, we show that the notion of algorithmic stability in Deﬁnition 5 is equivalent to the notion of uniform generalization in Deﬁnition 3. Before we do that, we ﬁrst state the following lemma. Lemma 1 (Data Processing Inequality). Let A, B, and C be three random variables that satisfy the Markov chain A B C. Then: S(A; B) S(A; C).

Proof. The proof consists of two steps 1. First, we note that because the Markov chain implies that P(C|B, A) = P(C|B), we have S(A; (B, C)) = S(A; B) by direct substitution into Deﬁnition 5. Second, similar to the information-cannot-hurt inequality in information theory [18], it can be shown that S(A; (B, C)) S(A; C) for any random variables A, B and C. This is proved using some algebraic manipulation and the fact that the minimum of the sums is always larger than the sum of minimums, i.e. min P

i min{αi, βi}. Combining both results yields S(A; B) = S(A; (B, C)) S(A; C), which is the desired result.

Now, we are ready to state the main result of this paper. Theorem 1. For any learning algorithm L : m=1 Zm H, algorithmic stability as given in Definition 5 is both necessary and sufﬁcient for uniform generalization (see Deﬁnition 3). In addition, ˆRtrue ˆRemp 1 S(H; Ztrn) 1 S(L), where Rtrue and Remp are the true and empirical risks of the learning algorithm deﬁned in Eq. (3) and (4) respectively.

Proof. Here is an outline of the proof. First, because a parametric loss function L( ; H) : Z [0, 1] is itself a random variable that satisﬁes the Markov chain Sm H L( ; H), it is not independent of Ztrn Sm. Hence, the empirical risk is given by ˆRemp = EL( ;H) EZtrn|L( ;H) L(Ztrn; H). By contrast, the true risk is given by ˆRtrue = EL( ;H) EZtrn P(z) L(Ztrn; H). The difference is:

ˆRtrue ˆRemp = EL( ;H) EZtrn L(Ztrn; H) EZtrn|L( ;H) L(Ztrn; H)

To sandwich the right-hand side between an upper and a lower bound, we note that if P1(z) and P2(z) are two distributions deﬁned on the same alphabet Z and F( ) : Z [0, 1] is a bounded loss

function, then EZ P1(z) F(Z) EZ P2(z) F(Z) ||P1(z) , P2(z)||T , where ||P , Q||T is the total variation distance. The proof to this result can be immediately deduced by considering the two regions {z Z : P1(z) > P2(z)} and {z Z : P1(z) < P2(z)} separately. This is, then, used to deduce the inequalities: ˆRtrue ˆRemp 1 S(L( ; H); Ztrn) 1 S(H; Ztrn) 1 S(L),

where the second inequality follows by the data processing inequality in Lemma 1, whereas the last inequality follows by deﬁnition of algorithmic stability (see Deﬁnition 5). This proves that if L is algorithmically stable, i.e. S(L) 1 as m , then ˆRtrue ˆRemp converges to zero uniformly across all parametric loss functions. Therefore, algorithmic stability is sufﬁcient for uniform generalization. The converse is proved by showing that for any δ > 0, there exists a bounded parametric loss and a distribution Pδ(z) such that 1 S(L) δ ˆRtrue ˆRemp 1 S(L). Therefore, algorithmic stability is also necessary for uniform generalization.

5 Interpreting Algorithmic Stability and Uniform Generalization

In this section, we provide several interpretations of algorithmic stability and uniform generalization. In addition, we show how Theorem 1 recovers some classical results in learning theory.

5.1 Algorithmic Stability and Data Processing

The relationship between algorithmic stability and data processing is presented in Lemma 1. Given the random variables A, B, and C and the Markov chain A B C, we always have S(A; B) S(A; C). This presents us with qualitative insights into the design of machine learning algorithms.

First, suppose we have two different hypotheses H1 and H2. We will say that H2 contains less informative than H1 if the Markov chain Sm H1 H2 holds. For example, if observations Zi {0, 1} are Bernoulli trials, then H1 R can be the empirical average as given in Example 1 while H2 {0, 1} can be the label that occurs most often in the training set. Because H2 = I{H1 m/2}, the hypothesis H2 contains strictly less information about the original training set than H1. Formally, we have Sm H1 H2. In this case, H2 enjoys a better uniform generalization bound than H1 because of data-processing. Intuitively, we know that such a result should hold because H2 is less tied to the original training set than H1. This brings us to the following remark.

1Detailed proofs are available in the supplementary ﬁle.

Remark 1. We can improve the uniform generalization bound (or equivalently algorithmic stability) of a learning algorithm by post-processing its inferred hypothesis H in a manner that is conditionally independent of the original training set given H. Example 2. Post-processing hypotheses is a common technique used in machine learning. This includes sparsifying the coefﬁcient vector w Rd in linear methods, where wj is set to zero if it has a small absolute magnitude. It also includes methods that have been proposed to reduce the number of support vectors in SVM by exploiting linear dependence [19]. By the data processing inequality, such methods improve algorithmic stability and uniform generalization.

Needless to mention, better generalization does not immediately translate into a smaller true risk. This is because the empirical risk itself may increase when the inferred hypothesis is post-processed independently of the original training set.

Second, if the Markov chain A B C holds, we also obtain S(A; C) S(B; C) by applying the data processing inequality to the reverse Markov chain C B A. As a result, we can improve algorithmic stability by contaminating training examples with artiﬁcial noise prior to learning. This is because if ˆSm is a perturbed version of a training set Sm, then Sm ˆSm H implies that S(Ztrn; H) S( ˆZtrn; H), when Ztrn Sm and ˆZtrn ˆSm are random training examples drawn uniformly at random from each training set respectively. This brings us to the following remark: Remark 2. We can improve the algorithmic stability of a learning algorithm by introducing artiﬁcial noise to training examples, and applying the learning algorithm on the perturbed training set. Example 3. Corrupting training examples with artiﬁcial noise, such as the recent dropout method, are popular techniques in neural networks to improve generalization [20]. By the data processing inequality, such methods indeed improve algorithmic stability and uniform generalization.

5.2 Algorithmic Stability and the Size of the Observation Space

Next, we look into how the size of the observation space Z inﬂuences algorithmic stability. First, we start with the following deﬁnition: Deﬁnition 6 (Lazy Learning). A learning algorithm L is called lazy if its hypothesis H H is mapped one-to-one with the training set Sm, i.e. the mapping H Sm is injective.

A lazy learner is called lazy if its hypothesis is equivalent to the original training set in its information content. Hence, no learning actually takes place. One example is instance-based learning when H = Sm. Despite their simple nature, lazy learners are useful in practice. They are useful theoretical tools as well. In particular, because of the equivalence H Sm and the data processing inequality, the algorithmic stability of a lazy learner provides a lower bound to the stability of any possible learning algorithm. Therefore, we can relate algorithmic stability (uniform generalization) to the size of the observation space by quantifying the algorithmic stability of lazy learners. Because the size of Z is usually inﬁnite, however, we introduce the following deﬁnition of effective set size. Deﬁnition 7. In a countable space Z endowed with a probability mass function P(z), the effective size of Z w.r.t. P(z) is deﬁned by: Ess [Z; P(z)] .= 1 + P

P(z) (1 P(z)) 2.

At one extreme, if P(z) is uniform over a ﬁnite alphabet Z, then Ess [Z; P(z)] = |Z|. At the other extreme, if P(z) is a Kronecker delta distribution, then Ess [Z; P(z)] = 1. As proved next, this notion of effective set size determines the rate of convergence of an empirical probability mass function to its true distribution when the distance is measured in the total variation sense. As a result, it allows us to relate algorithmic stability to a property of the observation space Z. Theorem 2. Let Z be a countable space endowed with a probability mass function P(z). Let Sm be a set of m i.i.d. samples Zi P(z). Deﬁne PSm(z) to be the empirical probability mass function induced by drawing samples uniformly at random from Sm. Then: ESm ||P(z), PSm(z)||T = q

Ess [Z; P(z)] 1

2 π m + o(1/ m), where 1 Ess [Z; P(z)] |Z| is the effective size of Z (see Definition 7). In addition, for any learning algorithm L : m=1 Zm H, we have S(H; Ztrn)

Ess [Z; P(z)] 1

2 π m o(1/ m), where the bound is achieved by lazy learners (see Deﬁnition 6)2.

2A special case of Theorem 2 was proved by de Moivre in the 1730s, who showed that the empirical mean of i.i.d. Bernoulli trials with a probability of success φ converges to the true mean at a rate of p

2φ(1 φ)/(πm)

Proof. Here is an outline of the proof. First, we know that P(Sm) = m m1, m2, ... pm1 1 pm2 2 , where is the multinomial coefﬁcient. Using the relation ||P, Q||T = 1 2||P Q||1, the multinomial series, and De Moivre s formula for the mean deviation of the binomial random variable [22], it can be shown with some algebraic manipulations that:

ESm ||P(z), PSm(z)||T = 1

k=1,2,... (1 pk)(1 pk)mp1+mpk k m! (pkm)! ((1 pk)m 1)!

Using Stirling s approximation to the factorial [17], we obtain the simple asymptotic expression:

ESm ||P(z), PSm(z)||T 1

k=1,2,3,...

Ess [Z; P(z)] 1

which is tight due to the tightness of the Stirling approximation. The rest of the theorem follows from the Markov chain Sm Sm H, the data processing inequality, and Deﬁnition 6.

Corollary 1. Given the conditions of Theorem 2, if Z is in addition ﬁnite (i.e. |Z| < ), then for

any learning algorithm L, we have: S(L) 1 q

2πm o(1/ m)

Proof. Because in a ﬁnite observation space Z, the maximum effective set size (see Deﬁnition 7) is |Z|, which is attained at the uniform distribution P(z) = 1/|Z|.

Intuitively speaking, Theorem 2 and its corollary state that in order to guarantee good uniform generalization for all possible learning algorithms, the number of observations must be sufﬁciently large to cover the entire effective size of the observation space Z. Needless to mention, this is difﬁcult to achieve in practice so the algorithmic stability of machine learning algorithms must be controlled in order to guarantee a good generalization from a few empirical observations. Similarly, the uniform generalization bound can be improved by reducing the effective size of the observation space, such as by using dimensionality reduction methods.

5.3 Algorithmic Stability and the Complexity of the Hypothesis Space

Finally, we look into the hypothesis space and how it inﬂuences algorithmic stability. First, we look into the role of the size of the hypothesis space. This is formalized in the following theorem. Theorem 3. Denote by H H the hypothesis inferred by a learning algorithm L : m=1 Zm H. Then, the following bound on algorithmic stability always holds:

where H is the Shannon entropy measured in nats (i.e. using natural logarithms).

Proof. The proof is information-theoretic. If we let I(X; Y ) be the mutual information between the r.v. s X and Y and let Sm = {Z1, Z2, . . . , Zm} be a random choice of a training set, we have:

I(Sm; H) = H(Sm) H(Sm | H) = h m X

i=1 H(Zi) i h H(Z1|H) + H(Z2|Z1, H) + i

Because conditioning reduces entropy, i.e. H(A|B) H(A) for any r.v. s A and B, we have:

i=1 [H(Zi) H(Zi | H)] = m [H(Ztrn) H(Ztrn | H)]

I(Ztrn; H) I(Sm; H)

on average. This is believed to be the ﬁrst appearance of the square-root law in statistical inference in the literature [21]. Because the effective set size of the Bernoulli distribution, according to Deﬁnition 7, is given by 1 + 4φ(1 φ), Theorem 2 agrees with, in fact generalizes, de Moivre s result.

Next, we use Pinsker s inequality [18], which states that for any probability distributions P and

Q: ||P , Q||T q

2 , where ||P , Q||T is total variation distance and D(P || Q) is the Kullback-Leibler divergence measured in nats (i.e. using natural logarithms). If we recall that S(Sm; H) = 1 ||P(Sm) P(H) , P(Sm, H)||T while mutual information is I(Sm; H) = D(P(Sm, H) || P(Sm) P(H)), we deduce from Pinsker s inequality and Eq. (5):

S(Ztrn; H) = 1 ||P(Ztrn) P(H) , P(Ztrn, H)||T

2m In the last line, we used the fact that I(X; Y ) H(X) for any random variables X and Y .

Theorem 3 re-establishes the classical PAC result on the ﬁnite hypothesis space [23]. In terms of algorithmic stability, a learning algorithm will enjoy a high stability if the size of the hypothesis space is small. In terms of uniform generalization, it states that the generalization risk of a learning algorithm is bounded from above uniformly across all parametric loss functions by p

H(H)/(2m) p

log |H|/(2m), where H(H) is the Shannon entropy of H.

Next, we relate algorithmic stability to the Vapnik-Chervonenkis (VC) dimension. Despite the fact that the VC dimension is deﬁned on binary-valued functions whereas algorithmic stability is a functional of probability distributions, there exists a connection between the two concepts. To show this, we ﬁrst introduce a notion of an induced concept class that exists for any learning algorithm L: Deﬁnition 8. The concept class C induced by a learning algorithm L : m=1 Zm H is deﬁned to be the set of total Boolean functions c(z) = I{P(Ztrn = z | H) P(Ztrn = z)} for all H H.

Intuitively, every hypothesis H H induces a total partition on the observation space Z given by the Boolean function in Deﬁnition 8. That is, H splits Z into two disjoint sets: the set of values in Z that are, a posteriori, less likely to have been present in the training set than before given that the inferred hypothesis is H, and the set of all other values. The complexity (richness) of the induced concept class C is related to algorithmic stability via the VC dimension. Theorem 4. Let L : m=1 Zm H be a learning algorithm with an induced concept class C. Let d V C(C) be the VC dimension of C. Then, the following bound holds if m > d V C(C) + 1:

S(L) 1 4 + p

d V C(C) (1 + log(2m))

2m In particular, L is algorithmically stable if its induced concept class C has a ﬁnite VC dimension.

Proof. The proof relies on the fact that algorithmic stability S(L) is bounded from below by 1

sup P(z) n ESm suph H EZ P(z) ch(Z) EZ Sm ch(Z) o , where c H(z) = I{P(Ztrn =

z|H) P(Ztrn) = z}. The ﬁnal bound follows by applying uniform convergence results [23].

6 Conclusions

In this paper, we showed that a probabilistic notion of algorithmic stability was equivalent to uniform generalization. In informal terms, a learning algorithm is called algorithmically stable if the impact of a single training example on the probability distribution of the ﬁnal hypothesis always vanishes at the limit of large training sets. In other words, the inference process never depends heavily on any single training example. If algorithmic stability holds, then the learning algorithm generalizes well regardless of the choice of the parametric loss function. We also provided several interpretations of this result. For instance, the relationship between algorithmic stability and data processing reveals that algorithmic stability can be improved by either post-processing the inferred hypothesis or by augmenting training examples with artiﬁcial noise prior to learning. In addition, we established a relationship between algorithmic stability and the effective size of the observation space, which provided a formal justiﬁcation for dimensionality reduction methods. Finally, we connected algorithmic stability to the complexity (richness) of the hypothesis space, which re-established the classical PAC result that the complexity of the hypothesis space should be controlled in order to improve stability, and, hence, improve generalization.

[1] V. N. Vapnik, An overview of statistical learning theory, Neural Networks, IEEE Transactions on, vol. 10, September 1999.

[2] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, vol. 20, pp. 273 297, 1995.

[3] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Learnability and the Vapnik Chervonenkis dimension, Journal of the ACM (JACM), vol. 36, no. 4, pp. 929 965, 1989.

[4] M. Talagrand, Majorizing measures: the generic chaining, The Annals of Probability, vol. 24, no. 3, pp. 1049 1103, 1996.

[5] D. A. Mc Allester, PAC-Bayesian stochastic model selection, Machine Learning, vol. 51, pp. 5 21, 2003.

[6] O. Bousquet and A. Elisseeff, Stability and generalization, The Journal of Machine Learning Research (JMLR), vol. 2, pp. 499 526, 2002.

[7] P. L. Bartlett and S. Mendelson, Rademacher and gaussian complexities: Risk bounds and structural results, The Journal of Machine Learning Research (JMLR), vol. 3, pp. 463 482, 2002.

[8] J.-Y. Audibert and O. Bousquet, Combining PAC-Bayesian and generic chaining bounds, The Journal of Machine Learning Research (JMLR), vol. 8, pp. 863 889, 2007.

[9] H. Xu and S. Mannor, Robustness and generalization, Machine learning, vol. 86, no. 3, pp. 391 423, 2012.

[10] A. Elisseeff, M. Pontil, et al., Leave-one-out error and stability of learning algorithms with applications, NATO-ASI series on Learning Theory and Practice Science Series Sub Series III: Computer and Systems Sciences, 2002.

[11] S. Kutin and P. Niyogi, Almost-everywhere algorithmic stability and generalization error, in Proceedings of the Eighteenth conference on Uncertainty in artiﬁcial intelligence (UAI), 2002.

[12] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, General conditions for predictivity in learning theory, Nature, vol. 428, pp. 419 422, 2004.

[13] M. Kearns and D. Ron, Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation, Neural Computation, vol. 11, no. 6, pp. 1427 1453, 1999.

[14] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, Learnability, stability and uniform convergence, The Journal of Machine Learning Research (JMLR), vol. 11, pp. 2635 2670, 2010.

[15] L. Devroye, L. Gy orﬁ, and G. Lugosi, A probabilistic theory of pattern recognition. Springer, 1996.

[16] V. Vapnik and O. Chapelle, Bounds on error expectation for support vector machines, Neural Computation, vol. 12, no. 9, pp. 2013 2036, 2000.

[17] H. Robbins, A remark on stirling s formula, American Mathematical Monthly, pp. 26 29, 1955.

[18] T. M. Cover and J. A. Thomas, Elements of information theory. Wiley & Sons, 1991.

[19] T. Downs, K. E. Gates, and A. Masters, Exact simpliﬁcation of support vector solutions, JMLR, vol. 2, pp. 293 297, 2002.

[20] S. Wager, S. Wang, and P. S. Liang, Dropout training as adaptive regularization, in NIPS, pp. 351 359, 2013.

[21] S. M. Stigler, The history of statistics: The measurement of uncertainty before 1900. Harvard University Press, 1986.

[22] P. Diaconis and S. Zabell, Closed form summation for classical distributions: Variations on a theme of de moivre, Statlstlcal Science, vol. 6, no. 3, pp. 284 302, 1991.

[23] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.