# pacbayesian_bounds_on_rateefficient_classifiers__b0a18acf.pdf

PAC-Bayesian Bounds on Rate-Efficient Classifiers

Alhabib Abbas 1 2 Yiannis Andreopoulos 2

We derive analytic bounds on the noise invariance of majority vote classifiers operating on compressed inputs. Specifically, starting from recent bounds on the true risk of majority vote classifiers, we extend the applicability of PAC-Bayesian theory to quantify the resilience of majority votes to input noise stemming from compression. The derived bounds are intuitive in binary classification settings, where they can be measured as expressions of voter differentials and voter pair agreement. By combining measures of input distortion with analytic guarantees on noise invariance, we prescribe rate-efficient machines to compress inputs without affecting subsequent classification. Our validation shows how bounding noise invariance can inform the compression stage for any majority vote classifier such that worst-case implications of bad input reconstructions are known, and inputs can be compressed to the minimum amount of information needed prior to inference.

1. Introduction

To learn concepts inherently contained in data, recent breakthroughs in deep learning, adversarial robustness, and bounded bayesian inference (Germain et al., 2015; Letarte et al., 2019; Vidot et al., 2021) typically assume models that can observe whole uncompressed volumes of d-dimensional inputs x X Rd in order to map x to target concepts y Y. However, in practice, systems with distributed computing assets (Pradhan et al., 2002; Xiao et al., 2006) employ lossy compression techniques to reduce the load of communication between inference machines and sensors that collect data at test time. In such contexts, whole volumes of x are inaccessible to machines tasked with inference, and only noisy reconstructions of x are available to infer concepts y.

1Meme Research Ltd., London, UK 2Dept. of Electronic and Electrical Eng., University College London, London, UK. Correspondence to: A. Abbas <alhabib.abbas.13@ucl.ac.uk>, Y. Andreopoulos <i.andreopoulos@ucl.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Adapting recent proposals to data scarcity at test time entails the design of models that infer concepts from compressed samples of input xc = C(x), where C : X E X encodes and decodes x with information loss, E Rdc is the set describing compressed codes of x, and dc d. From a rate-distortion theory perspective (Gray & Neuhoff, 1998), code lengths dc quantify the rate of bitstreams ingested by arbitrary models M : X Y, and since d 1 c |xc x|, rate-efficient classifiers define the family of models that satisfy M(x) = M(xc) for lower values of dc. Generally, models that return outputs sensitive to small perturbations on x can be deemed as greedy models, due to their requirement of long code lengths dc to reduce |xc x|. On the other hand, models that are resilient to input distortion are rate-efficient, since they require shorter code lengths dc to correctly infer concepts y. Thus, rate-efficiency defines a sub-class of models that are resilient to noise afflicted on x, and this sets the context of our approach to solve for rate-efficient machines.

We hereon define noise invariance as the probability of output change with respect to noisy input perturbations, and in the case of binary classifiers where y { 1, 1}, noise invariance specifically refers to the percentile of outputs flipped due to input compression noise. Bounding the noise invariance of classifiers M can allow for more aggressive compression, since it gives guarantees on the implications of reconstruction loss |xc x| on predicted outputs. Hence, wherever noise invariance bounds are defined over probabilistic measures on input noise, empirical measures of |xc x| can be inspected to compress inputs up to allowable limits on code lengths dc without distorting predictions M(xc). The latter motivates our proposal, and we summarise our contributions below:

1. We introduce the notion of rate-efficient classifiers to PAC-Bayesian theory, and derive relevant expressions of noise-invariance to define such classifiers.

2. We derive unsupervised general bounds on noiseinvariance, and specialize them to binary majority vote classifiers comprising linear kernels.

3. We combine noise-invariance bounds with measures of reconstruction loss to realize rate-efficient machines that bound errors resulting from lossy compression.

PAC-Bayesian Bounds on Rate-Efficient Classifiers

Our bounds on noise invariance incorporate two key aspects detrimental to the resilience of majority votes (James, 1998; Lacasse et al., 2006) to input distortion: (i) the agreement across voter pairs on predicted continuous outputs, and (ii) differentials of voter outputs with respect to shifting inputs. Both terms are intuitive, where the first emphasizes how the consensus across voters should exist away from decision boundaries, and the second relates the importance of smooth voters to noise-invariance. Generally, the noise invariance bounds we derive are applicable wherever inputs are reliably modelled as gaussian processes.

In Section 2 we give a concise introduction to relevant definitions in existing work on PAC-Bayesian theory. In Section 3 we formalise rate-efficient classifiers and state the problem we solve. In Section 4 we prime definitions on individual voters to qualify Section 5, where we derive noise invariance measures and bounds for majority votes. In Section 6 we discuss recent work on input perturbation bounds. In Section 7 we report results on all relevant theorems to validate our proposal detailed in Sections (3, 4, 5), and Section 8 concludes the paper.

2. An Introduction to PAC-Bayesian Theory

Let (x, y) denote input-output pairs drawn from an unknown domain D over the support X Y, where inputs x are normalised such that x X [ 1, 1]d, and y Y defines targets in a binary classification setting when Y = { 1, 1}. Given a set of m examples S Dm and a hypotheses set H = {hi(x)} where hi : X [ 1, 1], we define the true risk R(hi) and empirical risk RS(hi) as the expectation of errors made by classifiers sgn(hi) on D and S, respectively.

Definition 1 For any source domain D, and for any set of examples S, the true risk R(hi) and empirical risk RS(hi) measure the expectation:

R(hi) := E (x,y) D1R y hi(x)

RS(hi) := 1 m

(x,y) S 1R y hi(x)

where 1R (c) defines the indicator function such that 1R (c) = 1 if c R , and 1R (c) = 0 otherwise.

Specifically, the expectations of Definition 1 measure the mean zero-one loss (Bartlett et al., 2006). For any density function Q(hi) defined over H, PAC-Bayesian theory (Catoni, 2007; Langford & Shawe-Taylor, 2002; Mc Allester, 2003) bounds the true risk of majority vote (Bayes) classifiers BQ(x). The following formally states the output of kernelised majority votes BQ(x) weighted by Q(hi) when H defines a set of m voters hi(x) as functions of kernels (x i, yi) S (Germain et al., 2015).

Definition 2 For any input x X, for any posterior Q defined over training examples (x i, yi) S, and for any set of kernelised voters hi(x) = yi K(x, x i):

BQ(x) = sgn E (x i,yi) Q yi K(x, x i)

Interestingly, for arbitrary hypotheses hi(x), many binary classification techniques implicitly define majority vote classifiers. For example, kernelised support vector machines (Gholami & Fakhari, 2017), boosting methods (Sagi & Rokach, 2018), and mixtures of experts (Shazeer et al., 2017) can all be construed as special cases of Definition 2. Learning accurate majority vote classifiers entails learning posteriors Q(hi|S) over H such that the true risk R(BQ) of the Q-weighted majority vote BQ(x) is minimized.

Bounds on the true risk of BQ(x) are commonly derived from intermediary bounds on the risk of Gibbs classifiers GQ(x) (Lacasse et al., 2006) that randomly sample hypotheses hi Q to classify x. Indeed, the output of Gibbs classifiers can be different even if the same input is passed twice, and the following defines risk measures for GQ(x).

Definition 3 For any source domain D, and for any posterior Q on H, the true Gibbs risk R(GQ) and empirical Gibbs risk RS(GQ) measure the expectation of drawing an erroneous classifier sgn(hi) from Q:

R(GQ) := E (x,y) D E hi Q1R y hi(x)

RS(GQ) := 1 m

(x,y) S E hi Q1R y hi(x)

From Definition 3, a trivial bound on the majority vote classifier can be directly derived as R(BQ) 2R(GQ) (Lacasse et al., 2006; Langford & Shawe-Taylor, 2002). This is because, whenever BQ misclassifies x, the probability of drawing a classifier hi Q that misclassifies x is at at least 0.5. Hence, it follows that the true risk of BQ(x) is at most twice the true risk of GQ(x) and R(BQ) 2R(GQ).

2.1. PAC-Bayesian Bounds on Measures of Risk

The canonical majority vote bounds of (Catoni, 2007; Germain et al., 2015; Langford & Shawe-Taylor, 2002; Mc Allester, 2003) on R(GQ) are typically parameterized by: (i) the number of training examples m constituting a training set S used to learn the posterior Q(hi|S), (ii) a prior distribution P(hi), and (iii) an arbitrary δ (0, 1] that specifies the probability of the bound. The following theorem expresses a common starting point for PAC-Bayesian bounds on R(GQ), first introduced by Mc Allester et al. in (Mc Allester, 1999).

PAC-Bayesian Bounds on Rate-Efficient Classifiers

Theorem 1 For any source distribution D, for any prior P on the hypothesis set H, for any convex distance function D : [0, 1] [0, 1] R, and for any posterior Q learned by observing m examples x S:

D(RS(GQ), R(GQ)) 1

δ E S D E hi P em D(RS(hi),R(hi)) 1 δ

Proof: Applying Markov s inequality on the positive expression Eh P em D(RS(hi),R(hi)), and converting Ehi P to Ehi Q yields:

E hi Q P(hi) Q(hi)em D(RS(hi),R(hi))

1 δ E S D E hi P em D(RS(hi),R(hi)) 1 δ

The result follows after taking ln( ) on each side and applying Jensen s inequality by exploiting the concavity and convexity of ln( ) and D, respectively. For a step-by-step account of this proof, see (Germain et al., 2009; 2015).

Relevant to our proposal is the specialization of Theorem 1 to convex divergence measures D on the probability of binary events. Specifically, by setting D(RS(GQ), R(GQ)) as the Kullback-Leibler divergence of binomial distributions kl(RS(GQ)||R(GQ)), the following corollary simplifies Theorem 1.

Corollary 1 For any source domain D, for any prior P on the hypothesis set H, for the binomial distance function kl( ) : [0, 1] [0, 1] R, and for any posterior Q learned by observing m examples x S:

kl(RS(GQ)||R(GQ))

KL(Q||P)+ ln ξ(m)

when ξ(m) := Pm k=0 m

k k/m k 1 k/m)m k

Proof: Since the Kullback-Leibler divergence kl(q||p) between two binomial distributions parameterised by p and q is: kl(q||p) = q ln q

p + (1 q) ln 1 q

1 p by interpreting RS(hi) and R(hi) as parameters of distinct binomial variates, ξ(m) emerges after breaking the exponential term in ES D Ehi P em D(RS(hi),R(hi)) of Theorem 1, where the expectation becomes:

E S D E hi P

m RS(hi) 1 RS(hi)

m(1 RS(hi))

m RS(hi) 1 k

m(1 RS(hi))

Since ES D RS(hi) = R(hi), the last expectation is simplified to ξ(m). Following the proof of Theorem 1 then yields the result of the corollary. For more details, see (Germain et al., 2009; 2015).

Importantly, even if Corollary 1 is typically defined to bound the empirical risk RS(GQ) (Germain et al., 2009; 2015), parameters of D( ) can be set to bound other binary events defined as functions of Q(hi|S) and P(hi). Thus, in order to adapt Corollary 1 to noise invariance measures for majority vote classifiers BQ(x), in Sections 4 and 5 we derive exact expressions of noise invariance as expectations over Q(hi|S) and P(hi).

3. Formalising Rate-Efficient Machines

We propose to realize rate-efficient machines by quantifying the resilience of classifiers to input degradation via PACBayesian bounds on noise invariance. Let xc = C(x, θc) denote any lossy compression of x parameterised by θc. To give a probabilistic handle σ2 c R on reconstruction noise, and letting I Rd d denote the identity matrix, we fit N(0, σ2 c I) on xc x such that σ2 c quantifies the extent of distortion inflicted on x by the compressor (see Figure 1). For high-rate compression regiments, gaussians accurately model reconstruction loss |xc x| in statistical representations of rate-distortion theory (Gray & Neuhoff, 1998). Capitalising on this, we define ηD as the probability of output change with respect to input perturbation for any model M : X Y.

Definition 4 For any source domain D, for any classification model M, and for any noise vector n N(0, σ2 c I) modelling perturbations on x, noise invariance ηD quantifies the probability of output change due to n:

ηD = E x D Pr n N

M(x) = M(x + n)

In distributed computing settings, inference (classifier) machines M cannot observe n directly. However, prior measures of σ2 c can be combined with knowledge of the functional form of M to infer probabilities of misclassification when M observes xc = x+n. We therefore endeavor to design resilient models that give PAC bounds Bη on the noise invariance ηD of M such that ηD Bη with probability δη, whenever empirical measures are given for σ2 c at test time. Our approach leverages existing PAC-Bayesian theory (Germain et al., 2009; 2015; Langford & Shawe-Taylor, 2002; Mc Allester, 2003) qualified in Section 2 to derive bounds on ηD for majority vote classifiers when M(x) = BQ(x). Note that we use the subscripted notation δη to distinguish it from δ, which commonly denotes the probability of PAC bounds BR on the true risk of majority votes in existing literature on PAC-Bayesian inference (Germain et al., 2009; 2015).

PAC-Bayesian Bounds on Rate-Efficient Classifiers

Figure 1. Observation model of η-enabled rate-efficient inference. By combining analytical guarantees with measures of compression noise, we realize rate-efficient machines that bound noise invariance. Note that M(x) = BQ(x) returns PAC bounds on risk via (δ, BR) and on compression noise invariance via (δη, Bη).

On the relevance of ηD to rate-efficiency: Bounds Bη on noise invariance ηD facilitate for rate-efficient machines, where compressors C(x, θc) can compress inputs to limits that do not distort classification outputs beyond Bη with probabilities 1 δη. That is, by estimating the variance of density functions N(0, σ2 c I) fitted on empirical measures of xc x at test time, compression parameters θc can be tuned until σ2 c reaches a target value σ2 t that guarantees a bound Bη on the noise invariance of BQ(x). For example, Figure 1 illustrates how measures of compression noise can be coupled with analytic PAC guarantees on noise invariance, such that compressors C(x, θc) can refine xc until realizing the highest σ2 c that is suitably bounded by Bη and δη. Importantly, compressors in Figure 1 can include any compression model with tunable parameters θc.

4. Quantifying Voter Noise Invariance

We are interested in resilient models that give consistent outputs despite noise vectors n N(0, σ2 c I) inflicted on compressed inputs xc = x + n. To quantify the noise invariance of majority vote classifiers to compression noise, we begin by defining a measure of resilience r Q(x) on individual voters hi(x) of majority vote classifiers BQ(x) as the expectation of voter disagreement due to perturbations n inflicted on compressed inputs.

Definition 5 For any source domain D, for any posterior Q defined over continuous voters hi : X [ 1, 1], and for any n N(0, σ2 c I), we define continuous voter resilience r Q(x) as:

r Q(x) := E hi Q E n N hi(x) hi(x + n)

Notably, the inner product hi(x) hi(x + n) in Definition 5 returns values { 1, 1} when both hi(x) and hi(x + n) yield exact binary values { 1, 1}. However, hi(x) hi(x + n) in Definition 5 also gives a rich measure on the resilience of individual voters by returning intermediate values [ 1, 1] whenever hi(x), hi(x + n) [ 1, 1],

to account for the certainty of hi(x) and hi(x + n) in assigning positive or negative classes. To understand the key characteristics that determine the resilience of individual voters to input perturbation, the next lemma decomposes r Q(x) of Definition 5.

Lemma 1 For any input x X, for any posterior Q, and for any noise vector n N(0, σ2 c I), the resilience measure r Q(x) can be decomposed to:

r Q(x) = E hi Q E n Nhi(x)2 + hi(x) hi(x + n) hi(x)

Proof: We rearrange Definition 5 and use expectation properties to segregate r Q(x) into the two terms of the lemma:

r Q(x) := E hi Q E n Nhi(x)hi(x + n)

= E hi Q E n Nhi(x) hi(x)+ hi(x + n) hi(x)

and distributing hi(x) into the the sum yields the lemma.

The expression of Lemma 1 highlights the dependence of r Q(x) on two measures: (i) the confidence of individual voters hi(x)2 [0, 1] where hi(x)2 scales quadratically against votes hi(x), and (ii) the differential of voter outputs hi(x + n) hi(x), which measures the amount of change on hi(x) as a result of perturbing inputs x with a noise vector n. It is also interesting to point out that, from Lemma 1, the term hi(x + n) hi(x) correlates with the differentials δhi(x)

δx specifically in local regions softly demarked by gaussian hyperspheres defined over the multivariate N(0, σ2 c I).

To close on properties relating to individual voters, the following defines a Q(x) as the expectation of agreement between continuous voter pairs.

Definition 6 For any input x X, for any posterior Q, a Q(x) quantifies the agreement between voters as:

a Q(x) := E hi Q E hj Q hi(x) hj(x)

Agreement as measured by Definition 6 will become helpful in simplifying expressions relating to the noise invariance of bayesian classifiers BQ(x). It is also worth noting here that a Q(x) correlates inversely with the concept of disagreement in the C-bound of (Germain et al., 2015), which emerges often in derivations of PAC-Bayesian bounds on risk. Specifically, the C-bound suggests that majority votes perform a trade-off between lower Gibbs risk R(GQ) and higher disagreements a Q(x) 1 to achieve better majority vote risks R(BQ). Bounds defined as functions of voter agreement benefit from the unsupervised nature of a Q(x), since it can be measured with higher precision without needing labels y, and our bound on noise invariance that we derive in Section 5 inherits this unsupervised aspect of a Q(x).

PAC-Bayesian Bounds on Rate-Efficient Classifiers

5. Noise Invariance of the Majority Vote

From Definition 4, we define the noise invariance ηD Q of majority vote classifiers as the probability of disagreement M(x) = M(x + n) when M(x) = BQ(x).

Definition 7 For any majority vote classifier BQ, for any posterior Q, and for any noise vector n N(0, σ2 c I), the resilience ηD Q of BQ(x) to perturbations on x is:

ηD Q := E x D Pr n N

E hi Q hi(x + n) E hj Q hj(x) 0

Lemma 2 For any posterior Q, for any noise vector n N(0, σ2 c I), and by extending hi(x+n), the noise invariance measure ηD Q can be decomposed to:

ηD Q = E x D Pr n N

E hi Q E hj Q hi(x) hj(x)

E hi Q E hj Q hj(x) hi(x) hi(x + n)

Proof: Exploiting the fact that hi(x+n) = hi(x)+ hi(x+ n) hi(x) , we rearrange Definition 7 and use expectation properties to decompose the inner multiplication:

E hi Qhi(x + n) E hj Q hj(x)

E hj Q hj(x) hi(x + n)

= E hi Q E hj Q hi(x) hj(x)

+ E hi Q E hj Q hj(x) hi(x + n) hi(x)

Substituting the last expression in the inequality of Definition 7 yields the result of the lemma.

Notably from Lemma 2, since the L.H.S of its inequality expresses a Q(x), noise invariance ηD Q becomes a function of the agreement between individual voters expressed in Definition 6. Next, we show how to leverage Lemma 2 in order to tractably compute the resilience of majority vote classifiers BQ(x).

5.1. Exact Expressions of ηD Q for Linear Kernels

So far, we discussed general Bayes classifiers defined over sets of kernelised voters, where each kernel is parameterized by training examples (x i, yi) S. Interestingly, some voter kernels yield exact expressions of Lemma 2 without the need to calculate expectations over n, and salient among these are linear classifiers hi(x) = yix ix that specify constant differentials δhi(x)

δx = yix i. By exploiting the commutative property of linear products to simplify hi(x + n) hi(x), the next lemma collapses the multivariate expression of noise in Lemma 2 to a tractable univariate.

Lemma 3 For any majority vote classifier defining a posterior Q over normalised linear voters x i S X where hi(x) = yix ix , and when ω Rd is a variate defining ω = Ex i Q Ex j Q x jx x i, noise invariance ηD Q becomes:

ηD Q = E x D Prt N(0,ω σ2 c Iω)

a Q(x) + t < 0

Proof: From the R.H.S of Lemma 2, and by exploiting the commutative property of linear dot products:

Prn N(0,σ2c I)

E hi Q E hj Q hj(x) hi(x + n) hi(x) !

= Prn N(0,σ2c I)

n E x i Q E x j Q x jx x i

= N(0, ω σ2 c Iω)

Letting ω = Ex i Q Ex j Q x jx x i and exploiting the multiplication property of multivariates yields the last step, and the lemma is obtained by substituting t N(0, ωσ2 c Iω ) in Lemma 2.

From Lemma 3, we derive an analytic expression of ηD Q for majority vote classifiers comprising linear kernels.

Theorem 2 For any majority vote classifier defining a posterior Q over normalised linear voters x i S X where hi(x) = yix ix , and when ω = Ex i Q Ex j Q x jx x i, invariance coefficients ηD Q are simplified to:

ηD Q = E x D 1 2

1 + erf a Q(x) p

Proof: From the inequality of Lemma 3, moving t to the R.H.S gives:

ηD Q = E x D Prt N(0,ωσ2c Iω )

When t N(0, ωσ2 c Iω ), the last expression simplifies

to the CDF: Φ( a Q(x) µ

1 + erf a Q(x) µ

erf(z) := 2 π R e z2dz defines the gaussian error function.

The result of Theorem 2 calculates ηD Q knowing only BQ(x) and σ2 c, and is tractable via importance sampling due to the collapsed univariate t. Hence, any non-vacuous bound on ηD Q of Theorem 2 would satisfy all the requirements described in Section 3 characterising good indicators of over-compression. Also interesting to note here is that, whenever N(x xc|0, σ2 c I) expresses an isotropic gaussian symmetric on 0, ηD Q of Theorem 2 can never exceed 0.5. This is because a Q(x) 0 such that Theorem 2 always returns the integration of N(0, ωσ2 c Iω ) up to the center of the gaussian, thereby ensuring ηD Q 0.5. This is also intuitive, since binary linear classifiers define demarcation hyperplanes in X, and examples are always equally likely to move towards or away from classification hyperplanes.

PAC-Bayesian Bounds on Rate-Efficient Classifiers

5.2. Upper-Bounding ηD Q and Distortion Inflicted Risk

To account for the double expectations ηD Q defines over voter pairs (hi, hj), and in order to bound ηD Q via Corollary 1, similar to (Germain et al., 2015) we consider the posteriors Q2 defined over H2 : H H denoting hypothesis sets comprising voter pairs hij = (hi, hj). The product rule gives Q2(hij) = Q(hi)Q(hj), and the next theorem adapts Corollary 1 to ηD Q.

Theorem 3 For any source distribution D, for any prior P 2 on the hypothesis set H2, for any posterior Q2 learned by observing S Dm, and for any arbitrary probability δη (0, 1]:

kl(ηS Q||ηD Q) 1

2KL(Q||P) + ln ξ(m)

Proof: Because ηD Q is a binary event defined over Q2(hij|S), the KL divergence term becomes:

KL(Q2||P 2) = E hij Q2 ln Q2(hij)

P(hi) + ln Q(hj)

the last expression simplifies to 2KL(Q||P) and we then apply Corollary 1 to get the bound of the theorem. For a detailed account on paired voters and how they relate to Corollary 1, see (Germain et al., 2015).

Corollary 2 For any arbitrary probability δη (0, 1], there exists an upper bound ηD Q Bη with probabilty 1 δη when Bη defines the supremum:

bη : kl(ηS Q||bη) 1

2KL(Q||P) + ln ξ(m)

Proof: Upper bounds Bη on ηD Q define the highest value of ηD Q that satisfies Theorem 3, and this yields the corollary.

5.3. Bounding the Effect of ηD Q on True Risk R(BQ)

Bounds Bη on ηD Q actually carry additional implicit bounds BR on the true risk of BQ(x + n) after noise distorts the input of majority votes BQ(x). Specifically, whenever the probability of output change ηD Q is bounded by Theorem 3, and when R(BQ) measures the true risk of BQ(x) for non-perturbed inputs x, the next corollary expresses BR as a closed-form expression of R(BQ) and Bη.

Corollary 3 For any source domain D, for any bound Bη on ηD Q, there exists a bound R(BQ(x + n)) BR after noise n N(0, σ2 c I) is applied on x, where BR measures:

BR := R(BQ) + Bη R(BQ) Bη

Proof: When BQ(x + n) infers targets from m perturbed examples x, the total number of correct outputs becomes 1 R(BQ) m 1 R(BQ) Bηm, and BR measures:

BR := 1 m 1 R(BQ) 1 R(BQ) Bη

= 1 1 R(BQ) [1 Bη]

The corollary follows by simplifying the last expression.

Closing remarks on Bη: From Corollary 2 we observe that upper bounds Bη on ηD Q are actually functions of: (i) the posterior Q(hi|S), (ii) the prior P(hi), and (iii) the variance σ2 c of input compression noise n. Therefore, by combining the bound of Theorem 3 with empirical estimates of σ2 c at test time, implications of compression are known before outputs of majority vote classifiers BQ(x) are observed. Thus, using Theorem 3 and following the prescription of Section 3, we are able to realize rate-efficient machines that compress inputs up to known limits that guarantee the bound ηD Q Bη with probability 1 δη.

5.4. Final Observations on Voter Kernels and ηD Q

On regularizing kernel selection: When classifiers BQ(x) are defined per Definition 2 as majority votes over a set of training kernels x i S, Definition 7 expresses noise invariance coefficients ηD Q as functions of: individual voters hi(x), training kernels x i, and parameters that weight the importance of each voter Q(hi|S). Thus, noise invariant bayes classifiers BQ(x) specify posteriors Q(hi|S) that prioritise smooth voters with kernels minimising δhi(x)

δx . Regulating training to learn such posteriors is possible via the Min Cq learning algorithm (Germain et al., 2013), where some voters can be favoured to others by setting higher probabilities Q(hi|S) for kernels x i that return lower differentials δhi(x)

δx . Further exploration of this is outside the scope of our proposal, and we leave implementations of noise invariance regulation to future work.

On viable kernel types: Generally, the bound of Theorem 3 is applicable for any majority vote whenever voters hi(x) are continuously differentiable. For example, this is the case for Multi-Layer Perceptrons (MLP), which follow linear transformations by sigmoid non-linearities such that K(x, x i) = [1 + exp( x ix )] 1. Incidentally, MLPs can also be construed as majority vote classifiers, because individual weight vectors make decisions on x that are subsequently pooled. Moreover, and since bounds of Theorem 2 can be calculated prior to compression, the tractability of the compressive loop in Figure 1 is not affected by the complexity of kernels. We leave to future work the derivation of non-vacuous bounds on ηD Q for majority vote classifiers with non-linear kernels.

PAC-Bayesian Bounds on Rate-Efficient Classifiers

Figure 2. (Left) Noise invariance bound Bη, training noise invariance ηS Q, and testing noise invariance ηT Q for increasing values of input degradation as quantified by σ2 c. (Right) Contours of the true risk bound BR after outputs are distorted (flipped) with probability ηD Q ; yellower hues indicate higher values of BR(BQ).

6. Related Work

The notion of noise invariance is closely related - but not equivalent - to adversarial robustness (Goodfellow et al., 2015), which measures the probability P(M(x + n ) = y) of misclassifying ground truth y due to learnt perturbations n (x, y, M), where n = arg maxn P(M(x + n) = y). To study the adverse effect of input perturbations, recent proposals (Cohen et al., 2019; Montasser et al., 2020; Rahimian, 2019; Vidot et al., 2021) introduced various bounds on adversarial robustness and decision making under uncertainty. For instance, (Cohen et al., 2019) provide bounds that estimate minimum perturbation norms required to yield adversarial examples, and (Salman et al., 2019) extend the method of (Cohen et al., 2019) to yield tighter bounds via adversarial training. More recently, (Vidot et al., 2021) provide bounds on adversarial robustness in white-box settings to understand when differentiable decision trees fail against adversarial attacks with high probabilities P(M(x + n ) = y). Importantly, the contribution of (Vidot et al., 2021) gives supervised bounds on risk averaged over sets of learnt perturbations n , and is not applicable to the iterative compression setting detailed in Section 3.

While the works cited above all study different aspects of perturbation invariance, our method contrasts (Cohen et al., 2019; Montasser et al., 2020; Vidot et al., 2021) in that it is the first to: (i) yield unsupervised bounds on probabilities P(M(x + n) = M(x)) outside the context of adversarial robustness that assumes knowledge of ground truth y and posteriors P(n |x, M) that maximise P(M(x+n ) = y), (ii) define tractable bounds as functions of (M, σ2 c) that can be measured offline prior to compression, and (iii) give bounds on ηD Q that are not averaged over perturbation sets, where ηD Q measures the probability of flipped outputs for any input x X. The unique set of properties detailed in (i)-(iii) allow us to integrate PAC-Bayesian bounds into the compression loop of Figure 1 to derive the rate-efficient PAC-Bayesian classifiers detailed in Section 3.

7. Evaluation

Test setting: To address a class of problems where inputs are typically costly to compress and stream prior to inference, and to overlap our evaluation with vision applications where inputs undergo lossy compression, we focus our experimental validation on joint image compression and classification. We evaluate the measures and bounds of Theorem 3 and Corollary 3 on the handwritten digits dataset MNIST (Deng, 2012) in two distinct settings:

1. Controlled test conditions where noise is drawn directly from gaussian densities N(0, σ2 c) to perturb inputs x, and σ2 c is directly specified.

2. A distributed visual inference setting, where inputs are compressed via JPEG2000 (Rabbani, 2002) prior to inference, and compression is tuned via a quality parameter q. Compression noise here is induced naturally when the compressor fails to accurately reconstruct x.

Adapting MNIST to binary classification: Following established practice (Germain et al., 2015; Letarte et al., 2019), we split MNIST into 45 binary classification tasks1, where each task is exclusive to a unique pair of MNIST classes. For each task, a training set S with m = 500 is randomly sampled, and remaining examples go to a test set T used for validation. Each task uses a unique set of examples, such that any pair of datasets returns the empty set, and combining all datasets returns all examples in MNIST. All results are averaged after validating on all 45 unique tasks.

Measuring σ2 c: For tests on naturally occurring noise, we derive measures on σ2 c by iterating over JPEG2000 compression rounds until a specified target σ2 t is met (as per Figure 1). When JPEG(x, θc) denotes JPEG compression (Rabbani, 2002) with parameters θc, and σ2( ) is the variance of fitted isotropic gaussians N(0, σ2), we perform:

σ2 c = max θc σ2 x JPEG(x, θc) s.t. σ2 c σ2 t

1MNIST classes give 10? 10 = 45 binary classification tasks, where k? denotes the kth triangle sum (Knuth, 2014).

PAC-Bayesian Bounds on Rate-Efficient Classifiers

Table 1. Training and testing noise invariance (ηS Q, ηT Q), noise invariance bounds Bη, and test risk RT (BQ) as measured on MNIST. Lower values of (ηS Q, ηT Q, Bη, RT ) indicate higher resilience to noise and correspond to higher rate-efficiency. For MNIST-JPEG we report dc as the average number of bytes per image as encoded by JPEG (Rabbani, 2002), and n C(x, θc) denotes compression noise.

MNIST n N(0, σ2 c)

MNIST-JPEG n C(x, θc)

ηS Q ηT Q |ηT Q Bη| Bη RT q dc ηS Q ηT Q |ηT Q Bη| Bη RT 0.010 0.046 0.013 0.149 0.162 0.24 0.016 90 571 0.001 0.000 0.060 0.060 0.19 0.100 0.180 0.275 0.058 0.333 0.29 0.022 70 293 0.002 0.008 0.056 0.064 0.23 0.250 0.235 0.348 0.052 0.400 0.32 0.023 50 254 0.002 0.007 0.053 0.060 0.26 0.500 0.294 0.392 0.063 0.455 0.36 0.023 30 208 0.002 0.026 0.043 0.069 0.26 0.750 0.315 0.411 0.077 0.488 0.39 0.026 10 194 0.003 0.064 0.003 0.067 0.27 1.000 0.338 0.470 0.031 0.501 0.44 0.029 5 188 0.004 0.069 0.002 0.071 0.28

Used Classifiers: We construct majority votes on finite selfcomplemented hypotheses sets H (Germain et al., 2015) of real-valued voters hi(x) to learn Q(hi|S) via Min Cq (Jean, 2019) when posteriors Q(hi|S) align with priors P(hi). Specifically, our implementation2 assigns training sets S of m training examples x i S to constitute kernels of linear voters hi(x) = x ix that define a majority vote classifier BQ(x), and we use the quadratic program of (Jean, 2019) to solve for Q(hi|S). To validate the unsupervised bounds of Theorem 3 and Corollary 3, Table 1 and Figure 2 report relevant empirical measures on noise invariance when δ = 0.05 and δη = 0.05. For all parameters not explicitly mentioned in our discussion, we retain specifications of the GRAAL-Research Min Cq implementation (Jean, 2019).

7.1. Evaluation on Controlled Noise

Table 1 (left) validates the results of applying Theorem 3 and Corollary 2 on MNIST examples when artificial noise is applied to x such that xc = x + n and n N(0, σ2 c I). We observe from Table 1 (left) that training noise invariance coefficients ηS Q are around 0.3 when σ2 c 0.5, and are at most 0.235 when σ2 c 0.25, giving a baseline on ηD Q for source domains similar to those of MNIST. Moreover, we note that the difference |ηT Q Bη| never exceeds 0.08 when σ2 c 0.1, showing that Bη is actually very informative of true noise invariance coefficients for higher values of σ2 c, where ηT Q never deviates substantially from Bη. This is also reflected in Figure 2 (left), where test set measures of ηT Q neatly track ηS Q for σ2 c 0.1, and mostly fall in the mid-point between ηS Q and Bη for σ2 c 0.2.

7.2. Evaluation on JPEG2000 Compression Noise

Table 1 (right) provides results for noise stemming from JPEG compression, the most common in image compression practice. We emulate a typical distributed visual inference setting as per Figure 1, where inputs are compressed prior to inference to match specified constraints on bitrate. To meet any PAC guarantee on noise invariance, we calculate

2https://github.com/git-alhabib/pacb-ni

the target σ2 t that yields a specified ηS Q from Theorem 3, and vary JPEG compression parameters θc to fit σ2 c as prescribed by Equation 7. Table 1 details relevant results under MNIST-JPEG, and additionally reports the dimensions of the compression latent space dc as the average number of bytes resulting from JPEG encoding.

Similar to our observations on synthetically perturbed MNIST inputs, the right half of Table 1 reports low absolute differences |ηT Q Bη|, indicating that bounds on ηT Q are valid even when noise is drawn from complex structures (i.e., the non-linear block models of JPEG in this case). Interestingly, we see that noise variance σ2 c induced by JPEG compression is always [0.016, 0.026], and so only the bluer left extremes of Figure 2 (right) become relevant for JPEG compression. By inspecting Figure 2 (right) when σ2 c [0.016, 0.026], we note that the discrepancy between R(BQ) and BR(BQ) is relatively low compared to exponentially increasing differences BR R(BQ) when approaching the right extremities of Figure 2 (right), indicated by yellower hues. Thus, whenever σ2 c is sufficiently low, bounds on BR give value to applications that require guarantees on risk degradation induced by lossy input compression.

8. Conclusion

We introduce the notion of rate-efficient classifiers to PACBayesian theory. The unsupervised noise invariance bounds we derive are intuitive and highlight the importance of voters that change gradually with respect to perturbations on input. By inspecting the variance of gaussians fitted on compression noise, our evaluation shows that the proposed bounds are non-vacuous and well-suited to manage compression noise prior to inference, and we demonstrated this on JPEG2000 compression. Notably, the presented noise invariance bounds are general, and applicable wherever inputs are sourced from gaussian processes (e.g., in models of disease progression, physics, and signal processing). Future work may investigate more applications, derivations of ηD Q for non-linear kernels, and regularization techniques that give tighter bounds on ηD Q.

PAC-Bayesian Bounds on Rate-Efficient Classifiers

9. Acknowledgements

This work was supported by the UK Engineering and Physical Sciences Research Council [Grant ID: EP/P02243X/1, EP/R025290/1, EP/R035342/1, EP/S028455/1].

Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Catoni, O. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Lecture Notes Monograph Series, 56, 2007.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019.

Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 353 360, 2009.

Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. In International conference on machine learning, pp. 738 746. PMLR, 2013.

Germain, P., Lacasse, A., Laviolette, F., Marchand, M., and Roy, J. F. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(26):787 860, 2015.

Gholami, R. and Fakhari, N. Support vector machine: principles, parameters, and applications. In Handbook of Neural Computation, pp. 515 535. Elsevier, 2017.

Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

Gray, R. M. and Neuhoff, D. L. Quantization. IEEE transactions on information theory, 44(6):2325 2383, 1998.

James, G. M. Majority vote classifiers: theory and applications. Stanford University, 1998.

Jean, R. https://github.com/graal-research/mincq. GRAALResearch, 2019.

Knuth, D. The Art of Computer Programming. Addison Wesley Professional, 2014.

Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. PAC-Bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In NIPS, pp. 769 776, 2006.

Langford, J. and Shawe-Taylor, J. PAC-Bayes & margins. Advances in neural information processing systems, 15: 439 446, 2002.

Letarte, G., Germain, P., Guedj, B., and Laviolette, F. Dichotomize and generalize: PAC-Bayesian binary activated deep neural networks. In Advances in Neural Information Processing Systems, pp. 6872 6882, 2019.

Mc Allester, D. A. Some PAC-Bayesian theorems. Machine Learning, 37(3):355 363, 1999.

Mc Allester, D. A. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21, 2003.

Montasser, O., Hanneke, S., and Srebro, N. Reducing adversarially robust learning to non-robust PAC learning. 2020.

Pradhan, S. S., Kusuma, J., and Ramchandran, K. Distributed compression in a dense microsensor network. IEEE Signal Processing Magazine, 19(2):51 60, 2002.

Rabbani, M. JPEG2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging, 11 (2):286, 2002.

Rahimian, H. Distributionally robust optimization: A review. 2019.

Sagi, O. and Rokach, L. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.

Salman, H., Li, J., Razenshteyn, I., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32:11292 11303, 2019.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017.

Vidot, E., Viallard, P., Guillaume, Habrard, A., and Morvant, E. A PAC-Bayes analysis of adversarial robustness. Advances in Neural Information Processing Systems, 34, 2021.

Xiao, J., Ribeiro, A., Luo, Z.-Q., and Giannakis, G. B. Distributed compression-estimation using wireless sensor networks. IEEE Signal Processing Magazine, 23(4):27 41, 2006.