# controlling_directions_orthogonal_to_a_classifier__fd7dd3de.pdf

Published as a conference paper at ICLR 2022

CONTROLLING DIRECTIONS ORTHOGONAL TO A CLASSIFIER

Yilun Xu, Hao He, Tianxiao Shen, Tommi Jaakkola Computer Science and Artiﬁcial Intelligence Lab, Massachusetts Institute of Technology {ylxu, haohe, tianxiao}@mit.edu; tommi@csail.mit.edu

We propose to identify directions invariant to a given classiﬁer so that these directions can be controlled in tasks such as style transfer. While orthogonal decomposition is directly identiﬁable when the given classiﬁer is linear, we formally deﬁne a notion of orthogonality in the non-linear case. We also provide a surprisingly simple method for constructing the orthogonal classiﬁer (a classiﬁer utilizing directions other than those of the given classiﬁer). Empirically, we present three use cases where controlling orthogonal variation is important: style transfer, domain adaptation, and fairness. The orthogonal classiﬁer enables desired style transfer when domains vary in multiple aspects, improves domain adaptation with label shifts and mitigates the unfairness as a predictor. The code is available at https://github.com/Newbeeer/orthogonal_classifier.

1 INTRODUCTION

Many machine learning applications require explicit control of directions that are orthogonal to a predeﬁned one. For example, to ensure fairness, we can learn a classiﬁer that is orthogonal to sensitive attributes such as gender or race (Zemel et al., 2013; Madras et al., 2018). Similar, if we transfer images from one style to another, content other than style should remain untouched. Therefore images before and after transfer should align in directions orthogonal to style. Common to these problems is the task of ﬁnding an orthogonal classiﬁer. Given any principal classiﬁer operating on the basis of principal variables, our goal is to ﬁnd a classiﬁer, termed orthogonal classiﬁer, that predicts the label on the basis of orthogonal variables, deﬁned formally later.

The notion of orthogonality is clear in the linear case. Consider a joint distribution PXY over X Rd and binary label Y . Suppose the label distribution is Bernoulli, i.e., PY = B(Y ;0.5) and class-conditional distributions are Gaussian, PX Y =y = N(X;µy,σ2 y I), where the means and variances depend on the label. If the principal classiﬁer is linear, w1 = Pr(Y = 1 θ 1x), any classiﬁer w2, in the set W2 = {Pr(Y = 1 θ 2x) θ 1θ2 = 0}, is considered orthogonal to w1. Thus the two classiﬁers w1,w2, with orthogonal decision boundaries (Fig. 1) focus on distinct but complementary attributes for predicting the same label.

Finding the orthogonal classiﬁer is no longer straightforward in the non-linear case. To rigorously deﬁne what we mean by the orthogonal classiﬁer, we ﬁrst introduce the notion of mutually orthogonal random variables that correspond to (conditinally) independent latent variables mapped to observations through a diffeomorphism (or bijection if discrete). Each r.v. is predictive of the label but represents complementary information. Indeed, we show that the orthogonal random variable maximizes the conditional mutual information with the label given the principal counterpart, subject to an independence constraint that ensures complementarity.

Our search for the orthogonal classiﬁer can be framed as follows: given a principal classiﬁer w1 using some unknown principal r.v. for prediction, how do we ﬁnd its orthogonal classiﬁer w2 relying solely on orthogonal random variables? The solution to this problem, which we call classiﬁer orthogonalization, turns out to be surprisingly simple. In addition to the principal classiﬁer, we assume access to a full classiﬁer wx that predicts the same label based on all the available information, implicitly relying on both principal and orthogonal latent variables. The full classiﬁer can be trained normally, absent of constraints1. We can then effectively subtract the contribution of w1 from the full classiﬁer to obtain the orthogonal classiﬁer w2 which we denote as w2 = wx w1. The advantage of this construction is that

1The full classiﬁer may fail to depend on all the features, e.g., due to simplicity bias (Shah et al., 2020).

Published as a conference paper at ICLR 2022

2 1 0 1 2 3 4 5 x1

2 1 0 1 2 3 4 5 x1

2 1 0 1 2 3 4 5 x1

2 1 0 1 2 3 4 5 x1

(a) Distribution p X Y (b) wx(x) (c) w1(x) (d) w2(x) Figure 1: An illustrative example of orthogonal classiﬁer in a linear case. (a) is the data distributions in two classes; (b,c,d) are the probabilities of the data from class 1 predicted by the full/principal/orthogonal classiﬁers. Red and blue colors mean a probability close to 1 or 0. The white color indicates regions with a probability close to 0.5, which are classiﬁers decision boundaries. Clearly, w1 and w2 have orthogonal decision boundaries.

we do not need to explicitly identify the underlying orthogonal variables. It sufﬁces to operate only on the level of classiﬁer predictions.

We provide several use cases for the orthogonal classiﬁer, either as a predictor or as a discriminator. As a predictor, the orthogonal classiﬁer predictions are invariant to the principal sensitive r.v., thus ensuring fairness. As a discriminator, the orthogonal classiﬁer enforces a partial alignment of distributions, allowing changes in the principal direction. We demonstrate the value of such discriminators in 1) controlled style transfer where the source and target domains differ in multiple aspects, but we only wish to align domain A s style to domain B, leaving other aspects intact; 2) domain adaptation with label shift where we align feature distributions between the source and target domains, allowing shifts in label proportions. Our results show that the simple method is on par with the state-of-the-art methods in each task.

2 NOTATIONS AND DEFINITION

Symbols. We use the uppercase to denote random variable (e.g., data X, label Y ), the lowercase to denote the corresponding samples and the calligraphic letter to denote the sample spaces of r.v., e.g., data sample space X. We focus on the setting where label space Y is discrete, i.e., Y = {1, ,C}, and denote the C 1 dimensional probability simplex as C. A classiﬁer w X C is a mapping from sample space to the simplex. Its y-th dimension w(x)y denotes the predicted probability of label y for sample x.

Distributions. For random variables A,B, we use the notation p A, p A B, p AB to denote the marginal/conditional/joint distribution, i.e., p A(a) = p(A = a), p A B(a b) = p(A = a B = b),p AB(a,b) = p(A = a,B = b). Sometimes, for simplicity, we may ignore the subscript if there is no ambiguity, e.g., p(a b) is an abbreviation for p A B(a b).

We begin by deﬁning the notion of an orthogonal random variable. We consider continuous X,Z1,Z2 and assume their supports are manifolds diffeomorphic to the Euclidean space. The probability density functions (PDF) are in C1. Given a joint distribution p XY , we deﬁne the orthogonal random variable as follows:

Deﬁnition 1 (Orthogonal random variables). We say Z1 and Z2 are orthogonal random variables w.r.t p XY if they satisfy the following properties:

(i) There exists a diffeomorphism f Z1 Z2 X such that f(Z1,Z2) = X.

(ii) Z1 and Z2 are statistically independent given Y , i.e., Z1 Z2 Y .

The orthogonality relation is symmetric by deﬁnition. Note that the orthogonal pair perfectly reconstructs the observations via the diffeomorphism f; as random variables they are also sampled independently from class conditional distributions p(Z1 Y ) and p(Z2 Y ). For example, we can regard foreground objects and background scenes in natural images as being mutually orthogonal random variables.

Remark. The deﬁnition of orthogonality can be similarly developed for discrete variables and discretecontinuous mixtures. For discrete variables, for example, we can replace the requirement of diffeomorphism with bijection.

Published as a conference paper at ICLR 2022

Since the diffeomorphism f is invertible, we can use z1 X Z1 and z2 X Z2 to denote the two parts of the inverse mapping so that Z1 = z1(X) and Z2 = z2(X). Note that, for a given joint distribution p XY , the decomposition into orthogonal random variables is not unique. There are multiple pairs of random variables that represent valid mutually orthogonal latents of the data. We can further justify our deﬁnition of orthogonality from an information theoretic perspective by showing that the choice of z2 attains the maximum of the following constrained optimization problem. Proposition 1. Suppose the orthogonal r.v. of z1(X) w.r.t p XY exists and is denoted as z2(X). Then z(X) = z2(X) is a maximizer of I(z(X);Y z1(X)) subject to I(z(X);z1(X) Y ) = 0.

We defer the proof to Appendix B.1. Proposition 1 shows that the orthogonal random variable maximizes the additional information about the label we can obtain from X while remaining conditionally independent of the principal random variable. This ensures complementary in predicting the label.

3 CONSTRUCTING THE ORTHOGONAL CLASSIFIER

Let Z1 = z1(X) and Z2 = z2(X) be mutually orthogonal random variables w.r.t p XY . We call Z1 the principal variable and Z2 the orthogonal variable. In this section, we describe how we can construct the Bayes optimal classiﬁer operating on features Z2 from the Bayes optimal classiﬁer relying on Z1. We formally refer to the classiﬁers of interests as: (1) principal classiﬁer w1(x)y = p(Y = y Z1 = z1(x)); (2) orthogonal classiﬁer w2(x)y = p(Y = y Z2 = z2(x)); (3) full classiﬁer wx(x)y = p(Y = y X = x).

3.1 CLASSIFIER ORTHOGONALIZATION

Our key idea relies on the bijection between the density ratio and the Bayes optimal classiﬁer (Sugiyama et al., 2012). Speciﬁcally, the ratio of densities p X Y (x i) and p X Y (x j), assigned to an arbitrary point

x, can be represented by the Bayes optimal classiﬁer w(x)i = Pr(Y = i x) as p X Y (x i) p X Y (x j) = p Y (j)w(x)i

p Y (i)w(x)j . Similar, the principal classiﬁer w1 gives us associated density ratios of class-conditional distributions over Z1. For any i,j Y, we have p Z1 Y (z1(x) i) p Z1 Y (z1(x) j) = p Y (j)w1(x)i

p Y (i)w1(x)j . These can be combined to obtain density ratios of class-conditional distribution p Z2 Y and subsequently calculate the orthogonal classiﬁer w2. We additionally rely on the fact that the diffeomorphism f permits us to change variables between x and z1,z2: p X Y (x i) = p Z1 Y (z1 i) p Z2 Y (z2 i) vol Jf(z1,z2), where vol Jf is volume of the Jacobian (Berger et al., 1987) of the diffeomorphism mapping. Taken together,

p Z2 Y (z2 i) p Z2 Y (z2 j) = p Z1 Y (z1 i) p Z2 Y (z2 i) vol Jf(z1,z2)

p Z1 Y (z1 j) p Z2 Y (z2 j) vol Jf(z1,z2) / p Z1 Y (z1 i)

p Z1 Y (z1 j) = wx(x)i

wx(x)j / w1(x)i

Note that since the diffeomorphism f is shared with all classes, the Jacobian is the same for all labelconditioned distributions on Z1,Z2. Hence the Jacobian volume terms cancel each other in the above equation. We can then ﬁnally work backwards from the density ratios of p Z2 Y to the orthogonal classiﬁer:

Pr(Y = i Z2 = z2(x)) = Pr(Y = i)wx(x)i

w1(x)i / j (Pr(Y = j)wx(x)j

w1(x)j ) (2)

We call this procedure classiﬁer orthogonalization since it adjusts the full classiﬁer wx to be orthogonal to the principal classiﬁer w1. The validity of this procedure requires overlapping supports of the classconditional distributions, which ensures the classiﬁer outputs wx(x)i,w1(x)i to remain non-zero for all x X,i Y.

Empirically, we usually have access to a dataset D = {(xt,yt)}n t=1 with n iid samples from the joint distribution p XY . To obtain the orthogonal classiﬁer, we need to ﬁrst train the full classiﬁer ˆwx based on the dataset D. We can then follow the classiﬁer orthogonalization to get an empirical orthogonal classiﬁer, denoted as w2 = ˆwx w1. We use symbol to emphasize that the orthogonal classiﬁer uses complementary information relative to z1. Algorithm 1 summarizes the construction of the orthogonal classiﬁer.

Generalization bound. Since wx is trained on a ﬁnite dataset, we consider the generalization bound of the constructed orthogonal classiﬁer. We denote the population risk as R(w) = Ep XY (x,y) [log w(x)y] and the empirical risk as ˆR(w) = 1

D (xi,yi) D log w(xi)yi. For a function family W whose elements

map X to the simplex C, we deﬁne ˆwx = infwx W ˆR(wx),w x = infwx W R(wx). We further denote the Rademacher complexity of function family W with sample size D as R D (W).

Published as a conference paper at ICLR 2022

Algorithm 1 Classiﬁer Orthogonalization

Input: principal classiﬁer w1, dataset D. Train an empirical full classiﬁer ˆwx on D by empirical risk minimization. Construct an orthogonal classiﬁer ˆwx w1 via classiﬁer orthogonalization (Eq. (2)). return the empirical orthogonal classiﬁer ˆw2 = ˆwx w1

Theorem 1. Assume py is uniform distribution, wx W takes values in (m,1 m)C with m (0, 1

2), and 1/p X Y (x y) (0,γ) (0,+ ) holds for x X,y Y. Then for any δ (0,1) with probability at least 1 δ, we have:

R( ˆwx w1) R(w x w1) (1 + γ) 2R D (W) + 2log 1

Á Á À2log 1

Theorem 1 shows that the population risk of the empirical orthogonal classiﬁer in Algorithm 1 would be close to the optimal risk if the maximum value of the reciprocal of class-conditioned distributions 1/p X Y (x y) and the Rademacher term are small. Typically, the Rademacher complexity term satisﬁes R D (W) = O( D 1

2 ) (Bartlett & Mendelson, 2001; Gao & Zhou, 2016). We note that the empirical full classiﬁer may fail in speciﬁc ways that are harmful for our purposes. For example, the classiﬁer may not rely on all the key features due to simplicity bias as demonstrated by (Shah et al., 2020). There are several ways that this effect can be mitigated, including Ross et al. (2020); Teney et al. (2021).

3.2 ALTERNATIVE METHOD: IMPORTANCE SAMPLING

An alternative way to get the orthogonal classiﬁer is via importance sampling. For each sample point (x,y), we construct its importance φ(x,y) = p Z1(z1(x)) p Z1 Y (z1(x) y). Via Bayes rule, the importance φ(x,y) can

be calculated from the principal classiﬁer by φ(x,y) = p Y (y) w1(x)y . We deﬁne the importance sampling (IS) objective as LIS(w) = Ex,y p XY [φ(x,y)log w(x)y]. It can be shown that the orthogonal classiﬁer w2 maximizes the IS objective, i.e., w2 = arg max LIS. However, the importance sampling method has an Achilles heel: variance. A few bad samples with large weights can drastically deviate the estimation, which is problematic for mini-batch learning in practice. We provide an analysis of the variance of the IS objective. It shows the variance increases with the divergences between Z1 s marginal and its label-conditioned marginals, {Df(p Z1 p Z1 Y =y)}C y=1 even when the learned classiﬁer w is approaching the optimal classiﬁer, i.e., w w2. Let Ln IS(w) = 1

n n t=1 φ(xt,yt)log w(xt)yt be the empirical IS objective estimated with n iid samples. Clearly, Var( Ln IS(w)) = 1

n Var( L1 IS(w)). While Var( L1 IS(w)) at w = w2 is the following,

Var( L1 IS(w2)) = EY [(Df(p Z1 p Z1 Y =y) + 1)EZ2 Y =y log2 p Y Z2(y z2)] L(w2)2

where Df is the Pearson χ2-divergence. The expression indicates that the variance grows linearly with the expected divergence. In contrast, the divergences have little effect when training the empirical full classiﬁer in classiﬁer orthogonalization. We provide further experimental results in section 4.1.2 to demonstrate that classiﬁer orthogonalization has better stability than the IS objective with increasing divergences and learning rates.

4 ORTHOGONAL CLASSIFIER FOR CONTROLLED STYLE TRANSFER

In style transfer, we have two domains A,B with distributions PXY ,QXY . We use binary label Y to indicate the domain of data X (0 for domain A, 1 for domain B). Assume Z1 and Z2 are mutually orthogonal variables w.r.t both PXY and QXY . The goal is to conduct controlled style transfer between the two domains. By controlled style transfer, we mean transferring domain A s data to domain B only via making changes on partial latent (Z1) while not touching the other latent (Z2). Mathematically, we aim to transfer the latent distribution from PZ1PZ2 to QZ1PZ2. This task cannot be achieved by traditional style transfer algorithms such as Cycle GAN (Zhu et al., 2017), since they directly transfer data distributions from PX to QX, or equivalently from the perspective of latent distributions, from PZ1PZ2 to QZ1QZ2. Below we show that the orthogonal classiﬁer wx w1 enables such controlled style transfer.

Our strategy is to plug the orthogonal classiﬁer into the objective of Cycle GAN. The original Cycle GAN objective deﬁnes a min-max game between two generators/discriminator pairs GAB/DAB and

Published as a conference paper at ICLR 2022

1.2 1.4 1.6 1.8 log y[Df(p Z1||p Z1|Y = y) + 1]

IS wx\ w1 Oracle

3.0 2.5 2.0 1.5 1.0 Learning rate (log-scale)

IS wx\ w1 Oracle

0.0 0.2 0.4 0.6 0.8 1.0 Predicted probability

P(Y = 1| B) (wx\ w1)

P(Y = 1| G) (wx\ w1)

P(Y = 1| B) (IS)

P(Y = 1| G) (IS)

Figure 2: (a,b): Test loss versus log expected divergence and learning rate. (c): The histogram of predicted probability of different methods. The two red lines are the ground-truth probabilities for P(Y = 1 Brown background) and P(Y = 1 Green background).

GBA/DBA. GAB transforms the images from domain A to domain B. We use P to denote the generative distribution, i.e., GAB(X) P for X P. Similarly, Q denotes the generative distribution of generator GBA. The minimax game of Cycle GAN is given by

min GAB,GBA max DAB,DBALAB GAN(GAB,DAB) + LBA GAN(GBA,DBA) + Lcyc(GAB,GBA) (3)

where GAN losses LAB GAN,LBA GAN minimize the divergence between the generative distributions and the targets, and the cycle consistency loss Lcyc is used to regularize the generators (Zhu et al., 2017). Concretely, the GAN loss LAB GAN is deﬁned as follows,

LAB GAN(GAB,DAB) = Ex Q[log DAB(x)] + Ex P [log(1 DAB(x))]

Now we tilt the GAN losses in Cycle GAN objective to guide the generators to perform controlled transfer. Specially, we replace LAB GAN in Eq. (3) with the orthogonal GAN loss LAB OGAN when updating the generator GAB:

LAB OGAN(GAB,DAB) = Ex P [log(1 φ(DAB(x),r(x)))]

where φ(DAB(x),r(x)) = DAB(x)r(x) (1 DAB(x))+DAB(x)r(x) and r(x) = wx w1(x)0

wx w1(x)1 . Consider an extreme case where we allow the model to change all latents, i.e., z1(x) = x. As a result, LOGAN degenerates to the original LGAN since wx w1 1

2 and φ(DAB(x),r(x)) DAB(x). The other orthogonal GAN loss LBA OGAN can be similarly derived. For a given generator GAB, the discriminator s optimum is achieved at D AB(x) = Q(x)/( P(x) + Q(x)). Assuming the discriminator is always trained to be optimal, the generator GAB can be viewed as optimizing the following virtual objective: LAB OGAN(GAB) =

LAB OGAN(GAB,D AB) = Ex P log P (x) P (x)+Q(x)r(x). The optimal generator in the new objective satisﬁes the following property.

Proposition 2. The global minimum of LAB OGAN(GAB) is achieved if and only if PZ1,Z2(z1,z2) = QZ1(z1)PZ2(z2).

We defer the proof to Appendix B.4. The proposition states that the new objective LAB OGAN converts the original global minimum QZ1QZ2 in LAB GAN to the desired one QZ1PZ2. The symmetric statement holds for LBA OGAN(GBA).

4.1 EXPERIMENTS

4.1.1 SETUP

Datasets. (a) CMNIST: We construct C(olors)MNIST dataset based on MNIST digits (Le Cun & Cortes, 2005). CMNIST has two domains and 60000 images with size (3,28,28). The majority of the images in each domain will have a digit color and a background color correspond to the domain: domain A red digit/green background and domain B blue digit/brown background . We deﬁne two bias degrees λd, λb to control the probability of the images having domain-associated colors, e.g., for an image in domain A, with λd probability, its digit is set to be red; otherwise, its digit color is randomly red or blue. The background colors are determined similarly with parameter λb. In our experiments,

Published as a conference paper at ICLR 2022

Table 1: CMNIST

Z1 acc. Z2 acc.

Vanilla 90.2 14.3 wx w1 (MLDG) 94.9 91.1 wx w1 (TRM) 89.2 96.2 wx w1 (Fish) 93.9 98.0

IS (Oracle) 90.0 100 wx w1 (Oracle) 94.9 100

Table 2: Celeb A-GH

Z1 acc. Z2 acc. FID

Vanilla 88.4 16.8 38.5 wx w1 (MLDG) 90.9 53.0 46.2 wx w1 (TRM) 92.2 56.5 39.8 wx w1 (Fish) 88.8 42.9 43.4

IS (Oracle) 89.1 40.6 44.3 wx w1 (Oracle) 93.7 57.4 39.7

we set λd = 0.9 and λb = 0.8. Our goal is to transfer the digit color (Z1) while leaving the background color (Z2) invariant. (b) Celeb A-GH: We construct the Celeb A-G(ender)H(air) dataset based on the gender and hair color attributes in Celeb A (Liu et al., 2015). Celeb A-GH consists of 110919 images resized to (3,128,128). In Celeb A-GH, domain A is non-blond-haired males, and the domain B is blond-haired females. Our goal is to transfer the facial characteristic of gender (Z1) while keeping the hair color (Z2) intact.

Models. We compare variants of orthogonal classiﬁers to the vanilla Cycle GAN (Vanilla) and the importance sampling objective (IS). We consider two ways of obtaining the w1 classiﬁer: (a) Oracle: The principal classiﬁer w1 is trained on an oracle dataset where Z1 is the only discriminative direction, e.g., setting bias degrees λd = 0.9 and λb = 0 in CMNIST such that only digit colors vary across domains. (b) Domain generalization: Domain generalization algorithms aim to learn the prediction mechanism that is invariant across environments and thus generalizable to unseen environments (Blanchard et al., 2011; Arjovsky et al., 2019). The variability of environments is considered as nuisance variation. We construct environments such that only Y Z1 is invariant. We defer details of the environments to Appendix E.1. We use three domain generalization algorithms, Fish (Shi et al., 2021), TRM (Xu & Jaakkola, 2021) and MLDG (Li et al., 2018), to obtain w1. We indicate different approaches by the parentheses after wx w1 and IS, e.g., wx w1 (Oracle) is the orthogonal classiﬁer with w1 trained on Oracle datasets.

Metric. We use three metrics for quantitative evaluation: 1) Z1 accuracy: the success rate of transferring an image s latent z1 from domain A to domain B; 2) Z2 accuracy: percentage of transferred images whose latents z2 are unchanged; 3) FID scores: a standard metric of image quality (Heusel et al., 2017). We only report FID scores on the Celeb A-GH dataset since it is not common to compute FID on MNIST images. Z1,Z2 accuracies are measured by two oracle classiﬁers that output an image s latents z1 and z2 (Appendix E.1.3).

For more details of datasets, models and training procedure, please refer to Appendix E.

4.1.2 RESULTS

Comparison to IS. In section 3.2, we demonstrate that IS suffers from high variance when the divergences of the marginal and the label-conditioned latent distributions are large. We provide further empirical evidence that our classiﬁer orthogonalization method is more robust than IS. Fig. 2(a) shows that the test loss of IS increases dramatically with the divergences. Fig. 2(b) displays that IS s test loss grows rapidly when enlarging learning rate, which corroborates the observation that gradient variances are more detrimental to the model s generalization with larger learning rates (Wang et al., 2013). In contrast, the test loss of wx w1 remains stable with varying divergences and learning rates. In Fig. 2(c), we visualize the histograms of predicted probabilities of wx w1 (light color) and IS (dark color). We observe that the predicted probabilities of wx w1 better concentrates around the ground-truth probabilities (red lines).

Main result. As shown in Table 1 and 2, adding an orthogonal classiﬁer to the vanilla Cycle GAN signiﬁcantly improves its z2 accuracy (from 14 to 90+ in CMNIST, from 16 to 40+ in Celeb A-GH) while incurring a slight loss of the image quality. We observe that the oracle version of orthogonal classiﬁer (wx w1 (Oracle)) achieves best z2 accuracy on both datasets. In addition, class orthogonalization is compatible with domain generalization algorithms when the prediction mechanism Z1 Y is invariant across collected environments.

In Fig. 3, we visualize the transferred samples from domain A. We observe that vanilla Cycle GAN models change the background colors/hair colors along with digit colors/facial characteristics in CMNIST and Celeb A. In contrast, orthogonal classiﬁers better preserve the orthogonal aspects Z2 in the input images. We also provide visualizations for domain B in Appendix C.

Published as a conference paper at ICLR 2022

Input Vanilla Oracle

(a) CMNIST: Domain A

Input Vanilla Oracle

(b) Celeb A-GH: Domain A

Figure 3: Style transfer samples on the domain A of CMNIST and Celeb A-GH. We visualize the inputs (top row) and the corresponding transferred samples of different methods. 5 ORTHOGONAL CLASSIFIER FOR DOMAIN ADAPTATION

In unsupervised domain adaptation, we have two domains, source and target with data distribution ps, pt. Each sample comes with three variables, data X, task label Y and domain label U. We assume U is uniformly distributed in {0 (source), 1 (target)}. The task label is only accessible in the source domain. Consider learning a model h = g f that is the composite of the encoder f X Z and the predictor g Z Y. A common practice of domain adaptation is to match the two domain s feature distributions ps(f(X)),pt(f(X)) via domain adversarial training (Ganin & Lempitsky, 2015; Long et al., 2018; Shu et al., 2018). The encoder f is optimized to extract useful feature for the predictor while simultaneously bridging the domain gap via the following domain adversarial objective:

min f max D L(f,D) = Ex ps[log D(f(x))] + Ex pt[log(1 D(f(x)))] (4)

where discriminator D Z [0,1] distinguishes features from two domains. The equilibrium of the objective is achieved when the encoder perfectly aligns the two domain, i.e., ps(f(x)) = pt(f(x)). We highlight that such equilibrium is not desirable when the target domain has shifted label distribution (Azizzadenesheli et al., 2019; des Combes et al., 2020). Instead, when the label shift appears, it is more preferred to align the conditioned feature distribution,i.e., ps(f(x) y) = pt(f(x) y), y Y.

Now we show that our classiﬁer orthogonalization technique can be applied to the discriminator for making it orthogonal to the task label. Speciﬁcally, consider the original discriminator as full classiﬁer wx, i.e., for a given f, wx(x)0 = arg max D L(f,D) = ps(f(x)) ps(f(x))+pt(f(x)). We then construct the principle classiﬁer w1 that discriminates the domain purely via the task label Y , i.e., w1(x)0 = Pr(U = 0 Y = y(x)). Note that here we assume the task label Y can be determined by the data X, i.e., Y = y(X). We focus on the case that Y is discrete so w1 could be directly computed via frequency count. We propose to train the encoder f with the orthogonalized discriminator wx w1,

min f L(f,(wx w1)( )0) = Ex ps[log(wx w1)(x)0] + Ex pt[log(1 (wx w1)(x)0)] (5)

Proposition 3. Suppose there exists random variable Z2 = z2(X) orthogonal to y(X) = Y w.r.t p(f(X),U). Then f achieves global optimum if and only if it aligns all label-conditioned feature distributions, i.e., ps(f(x) y) = pt(f(x) y), y Y.

Note that in practice, we have no access to the target domain label prior pt(Y ) and the label y(x) for target domain data x pt. Thus we use the pseudo label ˆy as a surrogate to construct the principle classiﬁer, where ˆy(x) = arg maxy h(x)y is generated by our model h.

5.1 EXPERIMENTS

Models. We take a well-known domain adaptation algorithm VADA (Shu et al., 2018) as our baseline. VADA is based on domain adversarial training and combined with virtual adversarial training and entropy regularization. We show that utilizing orthogonal classiﬁer can improve its robustness to label shift. We compare it with two improvements: (1) VADA+wx w1 which is our method that plugs in the orthogonal classiﬁer as a discriminator in VADA; (2) VADA+IW: We tailor the SOTA method for label shifted domain adaptation, importance-weighted domain adaptation (des Combes et al., 2020), and apply it to VADA. We also incorporate two recent domain adaptation algorithms for images Cy CADA (Hoffman et al., 2018) and ADDA (Tzeng et al., 2017) into comparisons.

Published as a conference paper at ICLR 2022

Setup. We focus on visual domain adaptation and directly borrow the datasets, architectures and domain setups from Shu et al. (2018). To add label shift between domains, we control the label distribution in the two domains. For source domain, we sub-sample 70% data points from the ﬁrst half of the classes and 30% from the second half. We reverse the sub-sampling ratios on the target domain. The label distribution remains the same across the target domain train and test set. Please see Appendix E.2 for more details.

Results. Table 3 reports the test accuracy on seven domain adaptation tasks. We observe that VADA+wx w1 improves over VADA across all tasks and outperforms VADA+IW on ﬁve out of seven tasks. We ﬁnd VADA+IW performs worse than VADA in two tasks, MNIST SVHN and MNIST MNIST-M. Our hypothesis is that the domain gap is large between these datasets, hindering the estimation of importance weight. Hence, VADA-IW is unable to adapt the label shift appropriately. In addition, the results show that VADA+wx w1 outperforms ADDA on six out of seven tasks.

Table 3: Test accuracy on visual domain adaptation benchmarks

Source MNIST SVHN MNIST DIGITS SIGNS CIFAR STL Target MNIST-M MNIST SVHN SVHN GTSRB STL CIFAR

Source-Only 51.8 75.7 34.5 85.0 74.7 68.7 47.7 ADDA 89.7 78.2 38.4 86.0 90.6 66.8 50.4 Cy CADA - 82.8 39.6 - - - -

VADA 77.8 79.0 35.7 90.3 93.6 72.4 53.1 VADA + IW 71.2 87.1 34.5 90.7 95.4 74.0 53.8 VADA + wx w1 79.1 88.0 40.5 90.6 95.2 74.5 54.1

6 ORTHOGONAL CLASSIFIER FOR FAIRNESS

We are given a dataset D = {(xt,yt,ut)}n t=1 that is sampled iid from the distribution p XY U, which contains the observations x X, the sensitive attributes u U and the labels y Y. Our goal is to learn a classiﬁer that is accurate w.r.t y and fair w.r.t u. We frame the fairness problem as ﬁnding the orthogonal classiﬁer of an totally unfair classiﬁer w1 that only uses the sensitive attributes u for prediction. We can directly get the unfair classiﬁer w1 from the dataset statistics, i.e., w1(x)y = p(Y = y U = u(x)). Below we show that the orthogonal classiﬁer of unfair classiﬁer meets equalized odds, one metric for fairness evaluation. Proposition 4. If the orthogonal random variable of U = u(X) w.r.t p XY exists, then the orthogonal classiﬁer wx w1 satisﬁes the criterion of equalized odds.

We emphasize that, unlike existing algorithms for learning fairness classiﬁer, our method does not require additional training. We obtain a fair classiﬁer via orthogonalizing an existing model wx which is simply the vanilla model trained by ERM on the dataset D.

6.1 EXPERIMENTS

Setup. We experiment on the UCI Adult dataset, which has gender as the sensitive attribute and the UCI German credit dataset, which has age as the sensitive attribute (Zemel et al., 2013; Madras et al., 2018; Song et al., 2019). We compare the orthogonal classiﬁer wx w1 to three baselines: LAFTR (Madras et al., 2018), which proposes adversarial objective functions that upper bounds the unfairness metrics; L-MIFR (Song et al., 2019), which uses mutual information objectives to control the expressivenessfairness trade-off; Re Bias (Bahng et al., 2020), which minimizes the Hilbert-Schmidt Independence Criterion between the model representations and biased representations; as well as the Vanilla (wx) trained by ERM. We employ two fairness metrics demographic parity distance ( DP) and equalized odds distance ( EO) deﬁned in Madras et al. (2018). We denote the penalty coefﬁcient of the adversarial or de-biasing objective as γ, whose values govern a trade-off between prediction accuracy and fairness, in all baselines. We borrow experiment conﬁgurations, such as CNN architecture, from Song et al. (2019). Please refer to Appendix E.3 for more details.

Results. Tables 4, 5 report the test accuracy, DP and EO on Adults and German datasets. We observe that the orthogonal classiﬁer decreases the unfairness degree of the Vanilla model and has competitive performances to existing baselines. Especially in the German dataset, compared to LAFTR, our method

Published as a conference paper at ICLR 2022

has the same EO but better DP and test accuracy. It is surprising that our method outperforms LAFTR, even though it is not specially designed for fairness. Further, our method has beneﬁt of being training-free, which allows it to be applied to improve any off-the-shelf classiﬁer s fairness without additional training.

Table 4: Accuracy v.s. Fairness (Adults)

Vanilla 84.5 0.19 0.20

LAFTR (γ = 0.1) 84.2 0.14 0.09 LAFTR (γ = 0.5) 84.0 0.12 0.07 L-MIFR (γ = 0.05) 81.6 0.04 0.15 L-MIFR (γ = 0.1) 82.0 0.06 0.16 Re Bias (γ = 100) 84.3 0.15 0.11 Re Bias (γ = 50) 84.4 0.17 0.16

wx w1 81.6 0.12 0.12

Table 5: Accuracy v.s. Fairness (German)

Vanilla 76.0 0.19 0.33

LAFTR (γ = 0.1) 73.0 0.11 0.17 LAFTR (γ = 0.5) 72.7 0.11 0.19 L-MIFR (γ = 0.1) 75.8 0.10 0.21 L-MIFR (γ = 0.05) 75.6 0.08 0.18 Re Bias (γ = 100) 73.0 0.07 0.17 Re Bias (γ = 50) 75.0 0.10 0.20

wx w1 75.4 0.09 0.18

7 RELATED WORKS

Disentangled representation learning Similar to orthogonal random variables, disentangled representations are also based on speciﬁc notions of feature independence. For example, Higgins et al. (2018) deﬁnes disentangled representations via equivalence and independence of group transformations, Kim & Mnih (2018) relates disentanglement to distribution factorization, and Shu et al. (2020) characterizes disentanglement through generator consistency and a notion of restrictiveness. Our work differs in two key respects. First, most deﬁnitions of disentanglement rely primarily on the bijection between latents and inputs (Shu et al., 2020) absent labels. In contrast, our orthogonal features must be conditionally independent given labels. Further, in our work orthogonal features remain implicit and they are used discriminatively in predictors. Several approaches aim to learn disentangled representations in an unsupervised manner (Chen et al., 2016; Higgins et al., 2017; Chen et al., 2018). However, Locatello et al. (2019) argues that unsupervised disentangled representation learning is impossible without a proper inductive bias.

Model de-biasing A line of works focuses on preventing model replying on the dataset biases. Bahng et al. (2020) learns the de-biased representation by imposing HSIC penalty with biased representation, and Nam et al. (2020) trains an unbiased model by up-weighting the high loss samples in the biased model. Li et al. (2021) de-bias training data through data augmentation. However, these works lack theoretical deﬁnition for the de-biased model in general cases and often require explicit dataset biases.

Density ratio estimation using a classiﬁer Using a binary classiﬁer for estimating the density ratio (Sugiyama et al., 2012) enjoys widespread attention in machine learning. The density ratio estimated by classiﬁer has been applied to Monte Carlo inference (Grover et al., 2019; Azadi et al., 2019), class imbalance (Byrd & Lipton, 2019) and domain adaptation (Azizzadenesheli et al., 2019).

Learning a set of diverse classiﬁers Another line of work related to ours is learning a collection of diverse classiﬁers through imposing penalties that relate to input gradients. Diversity here means that the classiﬁers rely on different sets of features. To encourage such diversity, Ross et al. (2018; 2020) propose a notion of local independence, which uses cosine similarity between input gradients of classiﬁers as the regularizer, while in Teney et al. (2021) the regularizer pertains to dot products. Ross et al. (2017) sequentially trains multiple classiﬁers to obtain qualitatively different decision boundaries. We defer more discussions of the advantages of classiﬁer orthogonalization over existing methods to Appendix F.

8 CONCLUSION

We consider ﬁnding a discriminative direction that is orthogonal to a given principal classiﬁer. The solution in the linear case is straightforward but does not generalize to the non-linear case. We deﬁne and investigate orthogonal random variables, and propose a simple but effective algorithm (classiﬁer orthogonalization) to construct the orthogonal classiﬁer with both theoretical and empirical support. Empirically, we demonstrate that the orthogonal classiﬁer enables controlled style transfer, improves existing alignment methods for domain adaptation, and has a lower degree of unfairness.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENTS

The work was partially supported by an MIT-DSTA Singapore project and by an MIT-IBM Grand Challenge grant. YX was partially supported by the HDTV Grand Alliance Fellowship. We would like to thank the anonymous reviewers for their valuable feedback.

Mart ın Arjovsky, L. Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. Ar Xiv, abs/1907.02893, 2019.

Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian J. Goodfellow, and Augustus Odena. Discriminator rejection sampling. Ar Xiv, abs/1810.06758, 2019.

K. Azizzadenesheli, Anqi Liu, Fanny Yang, and Anima Anandkumar. Regularized learning for domain adaptation under label shifts. Ar Xiv, abs/1903.09734, 2019.

Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In ICML, 2020.

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, 2001.

M. Berger, Bernard Gostiaux, and S. Levy. Differential geometry: Manifolds, curves, and surfaces. 1987.

G. Blanchard, Gyemin Lee, and C. Scott. Generalizing from several related classiﬁcation tasks to a new unlabeled sample. In NIPS, 2011.

Jonathon Byrd and Zachary Chase Lipton. What is the effect of importance weighting in deep learning? In ICML, 2019.

Tian Qi Chen, Xuechen Li, Roger B. Grosse, and David Kristjanson Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Neur IPS, 2018.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.

R emi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J. Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Ar Xiv, abs/2003.04475, 2020.

Yaroslav Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. Ar Xiv, abs/1409.7495, 2015.

Wei Gao and Zhi-Hua Zhou. Dropout rademacher complexity of deep neural networks. Science China Information Sciences, 59(7):072104, 2016.

Aditya Grover, Jiaming Song, Alekh Agarwal, Kenneth Tran, Ashish Kapoor, E. Horvitz, and S. Ermon. Bias correction of learned generative models using likelihood-free importance weighting. In DGS@ICLR, 2019.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.

Irina Higgins, Lo ıc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

Irina Higgins, David Amos, David Pfau, S ebastien Racani ere, Lo ıc Matthey, Danilo Jimenez Rezende, and Alexander Lerchner. Towards a deﬁnition of disentangled representations. Ar Xiv, abs/1812.02230, 2018.

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.

Published as a conference paper at ICLR 2022

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018.

Y. Le Cun and Corinna Cortes. The mnist database of handwritten digits. 2005.

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize: Meta-learning for domain generalization. Ar Xiv, abs/1710.03463, 2018.

Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Loddon Yuille, and Cihang Xie. Shape-texture debiased neural network training. Ar Xiv, abs/2010.05981, 2021.

Z. Liu, Ping Luo, Xiaogang Wang, and X. Tang. Deep learning face attributes in the wild. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3730 3738, 2015.

Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. Ar Xiv, abs/1811.12359, 2019.

Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional adversarial domain adaptation. In Neur IPS, 2018.

David Madras, Elliot Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations. In ICML, 2018.

Jun Hyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classiﬁer from biased classiﬁer. Ar Xiv, abs/2007.02561, 2020.

Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. In IJCAI, 2017.

Andrew Slavin Ross, Weiwei Pan, and Finale Doshi-Velez. Learning qualitatively diverse and interpretable rules for classiﬁcation. Ar Xiv, abs/1806.08716, 2018.

Andrew Slavin Ross, Weiwei Pan, Leo Anthony Celi, and Finale Doshi-Velez. Ensembles of locally independent prediction models. In AAAI, 2020.

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. Ar Xiv, abs/2006.07710, 2020.

Yuge Shi, Jeffrey S. Seely, Philip H. S. Torr, N. Siddharth, Awni Y. Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. Ar Xiv, abs/2104.09937, 2021.

Rui Shu, H. Bui, H. Narui, and S. Ermon. A dirt-t approach to unsupervised domain adaptation. Ar Xiv, abs/1802.08735, 2018.

Rui Shu, Yining Chen, Abhishek Kumar, Stefano Ermon, and Ben Poole. Weakly supervised disentanglement with guarantees. Ar Xiv, abs/1910.09772, 2020.

Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and S. Ermon. Learning controllable fair representations. In AISTATS, 2019.

Masashi Sugiyama, Taiji Suzuki, and T. Kanamori. Density ratio estimation in machine learning. 2012.

Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. Ar Xiv, abs/2105.05612, 2021.

Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2962 2971, 2017.

Chong Wang, X. Chen, Alex Smola, and E. Xing. Variance reduction for stochastic gradient optimization. In NIPS, 2013.

Yilun Xu and T. Jaakkola. Learning representations that support robust transfer of predictors. Ar Xiv, abs/2110.09940, 2021.

Published as a conference paper at ICLR 2022

R. Zemel, Ledell Yu Wu, Kevin Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In ICML, 2013.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242 2251, 2017.

Published as a conference paper at ICLR 2022

A MORE PROPERTIES OF ORTHOGONAL RANDOM VARIABLE

In this section we demonstrate the properties of orthogonal random variables. We also discuss the existence and uniqueness of orthogonal classiﬁer when given the principal classiﬁer. Below we list some useful properties of orthogonal random variables. Proposition 5. Let Z1 and Z2 be any mutually orthogonal w.r.t PXY , then we have

(i) Invariance: For any diffeomorphism g, g(Z1) and g(Z2) are also mutually orthogonal.

(ii) Distribution-free: Z1,Z2 are mutually orthogonal w.r.t PX Y if and only if there exists a diffeomorphism mapping between X and X .

(iii) Existence: Suppose the label-conditioned distributions py C1,supp(py) = X, y Y. For z1 C1 and z1(X) is diffeomorphism to Euclidean space, then i Y, there exists a diffeomorphism mapping (z1(X),z2,i(X)) such that z1(X) z2,i(X) Y = i. Further if there exists diffeomorphism mappings between z2,is, then the orthogonal r.v. of z1(X) exists.

In the following theorem, we justify the validity of the classiﬁer orthogonalization when z1(x) satisﬁes some regularity conditions. Theorem 2. Suppose w1(x)i = p(Y = i z1(x)). If the label-condition distributions pys, and z1 satisfy the conditions in Proposition 5.(iii), then wx w1 is the unique orthogonal classiﬁer of w1.

Note that our construction of wx w1 does not require the exact expression of z1(X), its orthogonal r.v or data generation process. The theorem states that the orthogonal classiﬁer constructed by classiﬁer orthogonalization is the Bayesian classiﬁer for any orthogonal random variables of z1(X).

B.1 PROOF FOR PROPOSITION 1

Proposition 1. Suppose the orthogonal r.v. of z1(X) w.r.t p XY exists and is denoted as z2(X). Then z(X) = z2(X) is a maximizer of I(z(X);Y z1(X)) subject to I(z(X);z1(X) Y ) = 0.

Proof. For any function z, by chain rule we have I(z1(X),z2(X);Y ) = I(z1(X);Y ) + I(z2(X);Y z1(X)) and I(z1(X),z(X);Y ) = I(z1(X);Y ) + I(z(X);Y z1(X)). Further, the deﬁnition of the orthogonal r.v. ensures that there exists a differmorphism f between X and z1(X),z2(X), which implies I(z1(X),z(f(z1(X),z2(X)));Y ) = I(z1(X),z(X);Y ).

Hence by data processing inequality we have: I(z1(X),z2(X);Y ) I(z1(X),z(f(z1(X),z2(X)));Y ) = I(z1(X),z(X);Y ) I(z1(X);Y ) + I(z2(X);Y z1(X)) I(z1(X);Y ) + I(z(X);Y z1(X)) I(z2(X);Y z1(X)) I(z(X);Y z1(X))

The above inequality shows that z = z2 maximize the mutual information I(z(X);Y z1(X)). Further, by deﬁnition of the orthogonal r.v., z = z2 is independent of z1 conditioned on Y , i.e. z(X) z1(X) Y which is equivalent to I(z(X);z1(X) Y ) = 0.

B.2 PROOF FOR THEOREM 1

Theorem 1. Assume py is uniform distribution, wx W takes values in (m,1 m)C with m (0, 1

2), and 1/p X Y (x y) (0,γ) (0,+ ) holds for x X,y Y. Then for any δ (0,1) with probability at least 1 δ, we have:

R( ˆwx w1) R(w x w1) (1 + γ) 2R D (W) + 2log 1

Á Á À2log 1

Proof. Denote the empirical risk of w as

ˆR(w) = ˆR(w) = 1

D (xi,yi) D log w(xi)yi

Published as a conference paper at ICLR 2022

and the population risk as

R(w) = Ep(x,y) [log w(x)y]

Step 1: bounding excess risk R( ˆwx) R(w x) .

By the classical result in theorem 8 in Bartlett & Mendelson (2001), we know that for any w W, δ (0,1), since log w(x)y (0,log 1

m), with probability at least 1 δ,

ˆR(w) R(w) = 1

n i=1 log(w(xi))yi Ep[log(w(x)y)] 2Rn(W) + 2log 1

Since ˆwx = infwx W ˆR(wx),w x = infwx W R(wx), and ˆR( ˆwx) R( ˆwx) 2Rn(W) +

δ n , ˆR(w x) R(w x) 2Rn(W) + 2log 1

δ n , we have

R( ˆwx) R(w x) 2Rn(W) + 2log 1

Step 2: bounding R( ˆwx w1) R(w x w1) by R( ˆwx) R(w x) .

R(w w1) = Ep(x,y)

w(x)y w1(x)y

y w(x)y w1(x)y

= R(w) + Ep(x,y)

log w1(x)y y w(x)y w1(x)y

Then we have:

R( ˆwx w1) R(w x w1)

R( ˆwx) R(w x) + RRRRRRRRRRRR Ep(x,y)

log w1(x)y y ˆwx(x)y w1(x)y

log w1(x)y y w x(x)y w1(x)y

RRRRRRRRRRRR

= R( ˆwx) R(w x) +

RRRRRRRRRRRRRR

log y ˆ wx(x)y w1(x)y

y w x(x)y w1(x)y

RRRRRRRRRRRRRR

R( ˆwx) R(w x) + max

RRRRRRRRRRRRRR

ˆ wx(x)y w1(x)y

w x(x)y w1(x)y

RRRRRRRRRRRRRR

RRRRRRRRRRRRRR

ˆ wx(x)y w1(x)y

w x(x)y w1(x)y

RRRRRRRRRRRRRR

We deﬁne the ratio r(x,y) = ˆ wx(x)y w x(x)y . We deﬁne y(x) = arg miny r(x,y), y(x) = arg maxy r(x,y).

We have r(x,y(x)) y ˆ wx(x)y /w1(x)y y w x(x)y /w1(x)y r(x, y(x)). Let assume 1 p(y x) γ for all x,y. For any

function y X Y, we have the following bound, where we have importance weight φ(x,y) = 1 p(y x) if y = y (x) otherwise 0,

Ex log r(x,y (x)) = Ex,yφ(x,y)log r(x,y) γ Ex,y log r(x,y) = γ R( ˆwx) R(w x))

As a result, we have

Elog y ˆwx(x)y /w1(x)y

y w x(x)y /w1(x)y max( Elog r(x,y(x)) , Elog r(x, y(x)) ) γ R( ˆwx) R(w x)

R( ˆwx w1) R(w x w1) R( ˆwx) R(w x)

RRRRRRRRRRRRRR

ˆ wx(x)y w1(x)y

w x(x)y w1(x)y

RRRRRRRRRRRRRR

RRRRRRRRRRRRRR

ˆ wx(x)y w1(x)y

w x(x)y w1(x)y

RRRRRRRRRRRRRR

(1 + γ) R( ˆwx) R(w x) (7)

Published as a conference paper at ICLR 2022

Combining Eq. (6) and Eq. (7), we know that with probability 1 δ, we have:

R( ˆwx w1) R(w x w1) (1 + γ) 2Rn(W) + 2log 1

B.3 PROOF FOR PROPOSITION 5 AND THEOREM 2

Proposition 5. Let Z1 and Z2 be any mutually orthogonal w.r.t PXY , then we have

(i) Invariance: For any diffeomorphism g, g(Z1) and g(Z2) are also mutually orthogonal.

(ii) Distribution-free: Z1,Z2 are mutually orthogonal w.r.t PX Y if and only if there exists a diffeomorphism mapping between X and X .

(iii) Existence: Suppose the label-conditioned distributions py C1,supp(py) = X, y Y. For z1 C1 and z1(X) is diffeomorphism to Euclidean space, then i Y, there exists a diffeomorphism mapping (z1(X),z2,i(X)) such that z1(X) z2,i(X) Y = i. Further if there exists diffeomorphism mappings between z2,is, then the orthogonal r.v. of z1(X) exists.

Proof. (i) Since Z1 and Z2 are independent, the independence also holds for g(Z1) and g(Z2). In addition, we construct the diffeomorphism between (g(Z1),g(Z2)) and X as ˆf(g(Z1),g(Z2)) = f(g 1(g(Z1)),g 1(g(Z2))) = X. Apparently f(g 1(Z1),g 1(Z2)) is a diffeomorphism.

(ii) We denote the diffeomorphism between X and X as ˆf. Then the diffeomorphism mapping between (Z1,Z2) and X is ˆf f. Thus Z1,Z2 are mutually orthogonal w.r.t X .

(iii) We will use a constructive proof.

Step 1: Denote dim(X) = d,dim(z1(X)) = k, then we can prove that the manifold[z1(X),Rd k] is diffeomorphism of X. We denote the diffeomorphism as f, and denote f(x) = [z1(x),t(x)] where z1 X z1(X),t X Rd k.

Step 2: For X Y = y, let Fj(t(x)j t(x)1, ,t(x)j 1,z1(x),1) R [0,1],j {1,...,d k} be the CDF corresponding to the distribution of t(x)j t(X)1 = t(x)1, ,t(X)j 1 = t(x)j 1,z1(X) = z1(x),Y = y.

h(t(x);z1(x))1 = F1(t(x)1 z1(x),y) h(t(x);z1(x))i = Fi(t(x)i t(x)1, ,t(x)i 1,z1(x),1),i {2,...,d}

The conditions p C1,p(x) > 0, x X ensure that the PDF of f(X) is also in C1 and takes values in (0, ). Thus the above CDFs are all diffeomorphism mapping. By the inverse CDF theorem, for any z1(x), h(t(X);z1(X) = z1(x)) is a uniform distribution in (0,1)d k. Hence the random variable deﬁned by the mapping h, i.e., z2,y(X) = h(t(X);z1(X)), is independent of z1(X) given Y = 1.

In addition, by the construction of z2,y(x) we know that there exists a diffeomorphism ˆf between (z1(X),t(X)) and (z1(X),z2,y(X)) such that ˆf(z1(X),t(X)) = (z1(X),z2,y(X)). Then f 1 ˆf 1(z1(X),z2,y(X)) = X is also a diffeomorphsim mapping.

Step 3: From the condition we know that for every y Y, there exists a diffeomorphism function my such that my z2,y(x) = z2,1(x). Apparently y, my z2,y(X) z2,1(X) Y = y by z2,y(X) z1(X) Y = y. Further, (z1,z2,1) is a diffeomorphism. Hence z2,1(X) is the orthogonal r.v. of z1(X) w.r.t PXY .

Theorem 2. Suppose w1(x)i = p(Y = i z1(x)). If the label-condition distributions pys, and z1 satisfy the conditions in Proposition 5.(iii), then wx w1 is the unique orthogonal classiﬁer of w1.

Proof. By the existence property in Proposition 5, we know that for X, there exists a orthogonal r.v. of z1(X) and denote it as z2(X1). By the proposition we know that there exists a common diffeomorphism f mapping (z1(X y),z2(X y)) to X y, y Y.

Published as a conference paper at ICLR 2022

Further, by change of variables and the deﬁnition of orthogonal r.v., we have pi,2(z2) pj,2(z2) =

pi,1(z1)pi,2(z2)vol Jf (z1,z2) pj,1(z1)pj,2(z2)vol Jf (z1,z2)/ pi,1(z1)

pj,1(z1) = pi(x) pj(x)/ pi,1(z1)

pj,1(z1) = wx(x)i wx(x)j / w1(x)i

w1(x)j . Thus via the bijection of classiﬁer and density ratios we know that wx w1(x)i = p(Y = i z2(x)).

B.4 PROOF FOR PROPOSITION 2

Proposition 2. The global minimum of LAB OGAN(GAB) is achieved if and only if PZ1,Z2(z1,z2) = QZ1(z1)PZ2(z2).

Proof. By deﬁnition, r(x) = w2(x) 1 w2(x) = P (z2(x))

Q(z2(x)). Thus we can reformulate the criterion LAB OGAN as the following:

LAB OGAN(GAB) = Ex P log P(x) P(x) + Q(x)r(x)

= Ez1,z2 P log P(z1,z2) P(z1,z2) + Q(z1)Q(z2) P (z2)

log Ez1,z2 P [ P(z1,z2) + Q(z1)P(z2)

P(z1,z2) ] = log 2

where we get the lower bound by Jensen s inequality. The equality holds when P(z1,z2) = Q(z1)P(z2).

B.5 PROOF FOR PROPOSITION 3

Proposition 3. Suppose there exists random variable Z2 = z2(X) orthogonal to y(X) = Y w.r.t p(f(X),U). Then f achieves global optimum if and only if it aligns all label-conditioned feature distributions, i.e., ps(f(x) y) = pt(f(x) y), y Y.

Proof. By deﬁnition, optimal classiﬁer wx(x)0 = ps(f(x)) ps(f(x))+pt(f(x)), principle classiﬁer w1(x)0 =

ps(y(x)) ps(y(x))+pt(y(x)). By the deﬁnition of orthogonal r.v., (Y,U) and f(X) have a bijection between them. It suggests, conditioned on Y , the supports of f(X) Y = y is non-overlapped for different y in both domains ps and pt. It means if ps(f(x) y) > 0 then y y, ps(f(x) y ) = 0.

Thus the orthogonal classiﬁer wx w1 satisﬁes:

(wx w1)(x)0 = ps(f(x))

ps(y(x)) /(ps(f(x))

ps(y(x)) + pt(f(x))

= y ps(f(x) y )ps(y )

ps(y(x)) / ( y ps(f(x) y )ps(y )

ps(y(x)) + y pt(f(x) y )pt(y )

= ps(f(x) y(x))ps(y(x))

ps(y(x)) / (ps(f(x) y(x))ps(y(x))

ps(y(x)) + pt(f(x) y(x))pt(y(x))

= ps(f(x) y(x)) ps(f(x) y(x)) + pt(f(x) y(x)) Thus the objective in Eq. (5) can be reformulated as, L(f,(wx w1)( )0)

=Ex ps[log ps(f(x) y(x)) ps(f(x) y(x)) + pt(f(x) y(x))] + Ex pt[log pt(f(x) y(x)) ps(f(x) y(x)) + pt(f(x) y(x))]

=Ey ps Ex ps(x y)[ log ps(f(x) y) + pt(f(x) y)

ps(f(x) y) ] + Ey pt Ex pt(x y)[ log ps(f(x) y) + pt(f(x) y)

pt(f(x) y) ]

Ey ps [log Ex ps(x y)[ps(f(x) y) + pt(f(x) y)

ps(f(x) y) ]] Ey pt [log Ex pt(x y)[ps(f(x) y) + pt(f(x) y)

pt(f(x) y) ]]

Published as a conference paper at ICLR 2022

The lower-bound holds due to Jensen s inequality. The equality is achieved if and only if ps(f(x) y)

pt(f(x) y) is invariant to x, for all y, i.e., y Y,ps(f(x) y) = pt(f(x) y).

B.6 PROOF FOR PROPOSITION 4

Proposition 4. If the orthogonal random variable of U = u(X) w.r.t p XY exists, then the orthogonal classiﬁer wx w1 satisﬁes the criterion of equalized odds.

Proof. We denote the orthogonal random variables as z2(X), and wx w1(x) = Pr[y z2(x)]. Since z2(X) U Y , we know that wx w1(X) U Y . Hence the prediction of the classiﬁer wx w1 is conditional independent of sensitive attribute U on the ground-truth label Y , which meets the equalized odds metric.

C EXTRA SAMPLES

Input Vanilla Oracle

(a) CMNIST: Domain A

Input Vanilla Oracle

(b) CMNIST: Domain B

Input Vanilla Oracle

(c) Celeb A-GH: Domain A

Input Vanilla Oracle

(d) Celeb A-GH: Domain B

Figure 4: CMNIST and Celeb A-GH

D EXTRA EXPERIMENTS

D.1 STYLE TRANSFER

Table 6: Celeb A

Z1 acc. Z2 acc. FID

Vanilla 75.3 87.0 36.2 wx w1-MLDG 81.8 93.5 35.2 wx w1-TRM 76.3 92.5 33.5 wx w1-Fish 74.2 93.3 33.1

IS-Oracle 73.3 88.1 36.5 wx w1-Oracle 84.0 95.2 38.5

We show transferred samples for both domains in Fig. 4 on CMNIST and Celeb A-GH.

Published as a conference paper at ICLR 2022

D.2 STYLE TRANSFER ON CELEBA

Unlike the split in Celeb A-GH dataset, domain A is males, and domain B is females. The two domains together form the full Celeb A (Liu et al., 2015) with 202599 images. Note that there exists large imbalance of hair colors within the male group, with only around 2% males having blond hair. Table 6 reports the Z1/Z2 accuracies and FID scores on the full Celeb A datasets. We ﬁnd that our method improves the style transfer on the original Celeb A dataset, especially the Z2 accuracy. It shows that our method can help style transfer in real-world datasets with multiple varying factors.

E EXTRA EXPERIMENT DETAILS

E.1 STYLE TRANSFER

E.1.1 DATASET DETAILS

We randomly sampled two digits colors and two background colors for CMNIST. For Celeb A-GH, we extract two subsets male with non-blond hair and female with blond hair by the attributes provided in Celeb A dataset. For all datasets, we use 0.8/0.2 proportions to split the train/test set.

For Celeb A dataset, following the implementation of Zhu et al. (2017) (https://github.com/ junyanz/Cycle GAN), we use 128-layer UNet as the generator and the discriminator in Patch GAN. For CMNIST dataset, we use a 6-layer CNN as the generator and a 5-layer CNN as the discriminator. We adopt Adam with learning rate 2e-4 as the optimizer and batch size 128/32 for CMNIST/Celeb A.

E.1.2 DETAILS OF MODELS

We craft datasets for training the oracle w1(x) = Pr(Y z1(x)). For CMNIST, oracle dataset has bias degrees 0.9/0. For Celeb A-GH, oracle dataset has {male, female} as classes and the hair colors under each class are balanced.

For each dataset, we construct two environments to train the domain generalization models. For CMNIST, the two environments are datasets with bias degrees 0.9/0.4 and 0.9/0.8. For Celeb A-GH, the two environments consist of different groups of equal size: one is {non-blond-haired males, blond-haired females} and the other is {non-blond-haired males, blond-haired females, non-blond-haired females, blond-haired males}. Note that the facial characteristic of gender (Z1) is the invariant prediction mechanism across environments, instead of the hair color.

E.1.3 EVALUATION FOR Z1/Z2 ACCURACY

To evaluate the accuracies on Z1, Z2 latents, we train two oracle classiﬁers w 1(x) = Pr(Y z1(x)),w 2(x) = Pr(Y z2(x)) on two designed datasets. Speciﬁcally, w1/w2 are trained on datasets D1/D2 with Z1/Z2 as the only discriminative direction. For CMNIST, D1 is the dataset with bias degrees 0.9/0 and D2 is the dataset with bias degrees 0/0.8. For Celeb A-GH, D1 is the dataset with {male, female} as classes and the hair colors under each class are balanced. D2 is the dataset with {blond hair, non-blond hair} as classes and the males/females under each class are balanced.

The Z1 accuracy on dataset D for transfer model G is:

Z1 accuracy = 1

D x D I(arg max y w 1(x) /= arg max y w 1(G(x)))

And Z2 accuracy is:

Z1 accuracy = 1

D x D I(arg max y w 2(x) = arg max y w 2(G(x)))

where I is the indicator function. Z1 accuracy measures the success rate of transferring the Z1 and Z2 accuracy measures the success rate of keeping Z2 aspects unchanged.

E.2 DOMAIN ADAPTATION

The full names for the seven datasets are MNIST, MNIST-M, SVHN, GTSRB, SYN-DIGITS (DIGITS), SYN-SIGNS (SIGNS), CIFAR-10 (CIFAR) and STL-10 (STL). We sub-sample each dataset by the designated imbalance ratio to construct the pairs.

Published as a conference paper at ICLR 2022

We use the 18-layer neural network in Shu et al. (2018) with pre-process via instance-normalization. Following Shu et al. (2018), smaller CNN is used for digits (MNIST, MNIST-M, SVHN, DIGITS) and trafﬁc signs (SIGNS, GTSRB) and larger CNN is used for CIFAR-10 and STL-10. We use the Adam optimizer and hyper-parameters in Appendix B in Shu et al. (2018) except for MNIST MNIST-M, where we ﬁnd that setting λs = 1,λd = 1e 1 largely improves the VADA performance over label shifts. Furthermore, we set λd = 1e 2, i.e., the default value suggested by Shu et al. (2018), to enable adversarial feature alignment.

E.3 FAIRNESS

The two datasets are: the UCI Adult dataset2 which has gender as the sensitive attribute; the UCI German credit dataset3 which has age as the sensitive attribute. We borrow the encoder, decoder and the ﬁnal fc layer from Song et al. (2019). For fair comparison, we use the same encoder+fc layer as the function family W of our classiﬁer since our method do not need decoder for reconstruction. We use Adam with learning rate 1e-3 as the optimizer, and a batch size of 64.

The original Re Bias method trains a set of biased models to learn the de-biased representations (Bahng et al., 2020), such as designing CNNs with small receptive ﬁelds to capture the texture bias. However, in the fairness experiments, the bias is explicitly given, i.e., the sensitive attribute. For fair comparison, we set the HSIC between the sensitive attribute and the representations as the de-bias regularizer.

F LEARNING A DIVERSE SET OF CLASSIFIERS

In this section we discuss the difference and advantages of the classiﬁer orthogonalization, compared to methods that learn a diverse set of classiﬁers. We also provide a counter-example to show that the orthogonal input gradient does not lead to a pair of orthogonal classiﬁers. We show that orthogonal classiﬁers might not in optimal solution set of minimizing input gradient penalty.

The goals of learning diverse classiﬁers include interpretability (Ross et al., 2017), overcoming simplicity bias (Teney et al., 2021), improving ensemble performance (Ross et al., 2020), and recovering confounding decision rules (Ross et al., 2018). There is a direct trade-off between diversity and accuracy and this is controlled by the regularization parameter. In contrast, in our work, given the principal classiﬁer, our method directly constructs a classiﬁer that uses only the orthogonal variables for prediction. There is no trade-off to set in this sense. We also focus on novel applications of controlling the orthogonal directions, such as style transfer, domain adaptation and fairness. Further, our notion of orthogonality is deﬁned via latent orthogonal variables that relate to inputs via a diffeomorphism (Deﬁnition 1) as opposed to directly in the input space. Although learning diverse classiﬁers through input gradient penalties is efﬁcient, it does not guarantee orthogonality in the sense we deﬁne it. We provide an illustrative counter-example in Appendix F, where the input space is not disentangled nor has orthogonal input gradients but corresponds to a pair of orthogonal classiﬁers in our sense. We also prove that, given the principal classiﬁer, minimizing the loss by introducing an input gradient penalty would not necessarily lead to an orthogonal classiﬁer.

We note that one clear limitation of our classiﬁer orthogonalization procedure is that we need access to the full classiﬁer. Learning this full classiﬁer can be impacted by simplicity bias (Shah et al., 2020) that could prevent it from relying on all the relevant features. We can address this limitation by leveraging previous efforts (Ross et al., 2020; Teney et al., 2021) to mitigate simplicity bias when training the full classiﬁer.

F.1 RELATIONSHIP TO ORTHOGONAL INPUT GRADIENTS

The orthogonality constraints on input gradients demonstrate good performance in learning a set of diverse classiﬁers (Ross et al., 2017; 2018; 2020; Teney et al., 2021). They typically use the dot product of input gradients between pairs of classiﬁers as the regularizer. We highlight that when facing the latent orthogonal random variables, the orthogonal gradient constraints on input space no longer guarantees to learn the orthogonal classiﬁer. We demonstrate it on a simple non-linear example below.

Consider a binary classiﬁcation problem with the following data generating process. Label Y is uniformly distributions in { 1,1}. Conditioned on the label, we have two independent k-dimensional latents

2https://archive.ics.uci.edu/ml/datasets/adult 3https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Published as a conference paper at ICLR 2022

Z1,Z2 such that Z1 Y = y N(yµ1,Ik), Z2 Y = y N(yµ2,Ik). µ1 Rk,µ2 Rm are the means of the latent variables, and k > m. The data X is generated from latents via the following

diffeomorphism X = g(Z1 ( Z2 0(k m)))

where g is an arbitrary non-linear diffeomorphism on Rk.

We can see that Z1,Z2 are mutually orthogonal w.r.t the distribution p XY . Now consider the Bayes optimal classiﬁers w1,w2 using variable Z1,Z2. We have w1(x) = p Y Z1(1 z1) = σ(2µT 1 z1) and w2(x) = p Y Z2(1 z2) = σ(2µT 2 z2) , where σ is the sigmoid function. Apparently, they are a pair of orthogonal classiﬁers. Besides, we denote the classiﬁer based on the ﬁrst m dimensions in z1 for prediction as w3, i.e., w3(x) = p Y (Z1) m(1 (z1) m) = σ(2(µ1)T m(z1) m).

Given the principal classiﬁer w1, we consider using the loss function in Ross et al. (2018; 2020) to learn the orthogonal classiﬁer of w1, by adding a input gradient penalty term to the standard classiﬁcation loss:

Lw1(w) = Lc(w) + λE[cos2( xw1(x), xw(x))]

Lc is the expected cross-entropy loss, cos is the cosine-similarity and λ is the hyper-parameter for the gradient penalty term (Ross et al., 2018; 2020). Below we show that (1) the orthogonal classiﬁer w2 does not necessarily satisfy E[cos2( xw1(x), xw2(x))] = 0. (2) the orthogonal classiﬁer w2 is not necessarily the minimizer of Lw1(w). Proposition 6. For the above (Z1,Z2), their corresponding Bayes classiﬁers w1,w2 satisfy E[ xw1(x)T xw2(x)] = 0 if and only if (µ1)T mµ2 = 0.

Proof. Let X1 = g(Z1 ( Z2 0(k m))), X2 = Z2 are the two components of X. Note that z1 = g 1(x1) +

( x2 0(k m)). By chain rule, the input gradient of w1 is

xw1(x) = dz1

dx z1 Pr(Y = 1 Z1 = z1(x))

= ( x1g 1(x1) Im k ) (k+m) k

e 2µT 1 z1(x)

(1 + e 2µT 1 z1(x))2 2µ1

where Im k is the submatrix of Ik k. Similarly we get xw2(x) = ( 0 Im m) (k+m) m

e 2µT 2 z2(x)

(1+e 2µT 2 z2(x))2 2µ2.

Together, the dot product between input gradients is

xw1(x)T xw2(x) = 2e 2µT 1 z1(x)µT 1 (1 + e 2µT 1 z1(x))2 ( x1g 1(x1) Im k )

(k+m) k ( 0 Im m) (k+m) m

2e 2µT 2 z2(x)µ2 (1 + e 2µT 2 z2(x))2

= 4e 2µT 1 z1(x) 2µT 2 z2(x)(µ1)T mµ2 (1 + e 2µT 1 z1(x))2(1 + e 2µT 2 z2(x))2

Since 4e 2µT 1 z1(x) 2µT 2 z2(x)

(1+e 2µT 1 z1(x))2(1+e 2µT 2 z2(x))2 > 0, E[ xw1(x)T xw2(x)] if and only if (µ1)T mµ2 = 0.

Proposition 6 ﬁrst shows that the expected input gradient product of the two Bayes classiﬁers of underlying latents is non-zero, unless the linear decision boundaries on z1,z2 are orthogonal. Hence, given the principal classiﬁer w1, the expected dot product of input gradients between w1 and its orthogonal classiﬁer w2 does not have to be zero. Proposition 7. The orthogonal classifer w2 does not minimize Lw1(w) if (µ1) m = µ2 and µT 1 x1g 1(x1)T x1g 1(x1)k m(µ1) m < 0. Particularly, we show that Lw1(w3) < Lw1(w2) in this case.

Proof. Let w3(x) = p Y (Z1) m(1 (z1) m). The input gradient of w3(x) is

xw3(x) = ( xw1(x)) m = dz1

dx (z1) m Pr(Y = 1 (Z1) m = (z1(x)) m)

= (( x1g 1(x1))k m Im m ) (k+m) m

e 2(µ1)T mz1(x) m

(1 + e 2(µ1)T mz1(x) m)2 2(µ1) m

Published as a conference paper at ICLR 2022

When (µ1) m = µ2, the dot-product between xw3(x), xw1(x) is

xw1(x)T xw3(x) = 2e 2µT 1 z1(x)µT 1 (1 + e 2µT 1 z1(x))2 ( x1g 1(x1) Im k )

(( x1g 1(x1))k m Im m ) (k+m) m

e 2(µ1)T mz1(x) m

(1 + e 2(µ1)T mz1(x) m)2 2(µ1) m

= 4e 2µT 1 z1(x) 2(µ1)T mz1(x) m

(1 + e 2µT 1 z1(x))2(1 + e 2(µ1)T mz1(x) m)2 ( µ2 2 2 +µT 1 x1g 1(x1)T x1g 1(x1)k m(µ1) m)

When (µ1) m = µ2, we know that z1(x),z2(x) are equally predictive of the label and thus Lc(w3) = Lc(w2). Note that if µT 1 x1g 1(x1)T x1g 1(x1)k m(µ1) m < 0, we have E[cos2( xw1(x), xw2(x))] > E[cos2( xw1(x), xw3(x))]:

E[cos2( xw1(x), xw2(x))]

( x1g 1(x1) I ) (k+m) k µ1 2 (0 I) (k+m) m µ2 2

( µ2 2 2 +µT 1 x1g 1(x1)T ( x1g 1(x1))k mµ2)

( x1g 1(x1) I ) (k+m) k µ1 2 ( x1g 1(x1) Im k ) (k+m) k µ2 2

2 = E[cos2( xw1(x), xw3(x))]

The strict inequality holds by µT 1 x1g 1(x1)T x1g 1(x1)k m(µ1) m < 0

( x1g 1(x1) Im k ) (k+m) k µ2 2 = ( x1g 1(x1) 0 ) (k+m) k µ2 + ( 0 Im k) (k+m) k µ2 2

= ( ( x1g 1(x1) 0 ) (k+m) k µ2 2 2 + ( 0 Im k) (k+m) k µ2 2 2)

( 0 Im k) (k+m) k µ2 2

To verify that µT 1 x1g 1(x1)T x1g 1(x1)k m(µ1) m < 0 does exist, pick k = 3,m = 2, g 1(x) =

2 3 3 1 1 5 1 1 1

x, and µ1 =

. We have µT 1 x1g 1(x1)T x1g 1(x1)k m(µ1) m = 2 in this case.

Together, we have Lw1(w3) < Lw1(w2).

Proposition 7 shows that the orthogonal classiﬁer w2 is not the minimizer of the loss with input gradient penalty. The results above also hold for un-normalized version of input gradient penalty in Teney et al. (2021) by similar analysis.