# oblivious_data_for_fairness_with_kernels__94baa94a.pdf Journal of Machine Learning Research 22 (2021) 1-36 Submitted 11/20; Revised 5/21; Published 8/21 Oblivious Data for Fairness with Kernels Steffen Grünewälder S.GRUNEWALDER@LANCASTER.AC.UK Azadeh Khaleghi A.KHALEGHI@LANCASTER.AC.UK Department of Mathematics and Statistics Lancaster University Lancaster, UK Editor: Massimiliano Pontil We investigate the problem of algorithmic fairness in the case where sensitive and non-sensitive features are available and one aims to generate new, oblivious , features that closely approximate the non-sensitive features, and are only minimally dependent on the sensitive ones. We study this question in the context of kernel methods. We analyze a relaxed version of the Maximum Mean Discrepancy criterion which does not guarantee full independence but makes the optimization problem tractable. We derive a closed-form solution for this relaxed optimization problem and complement the result with a study of the dependencies between the newly generated features and the sensitive ones. Our key ingredient for generating such oblivious features is a Hilbert-space-valued conditional expectation, which needs to be estimated from data. We propose a plug-in approach and demonstrate how the estimation errors can be controlled. While our techniques help reduce the bias, we would like to point out that no post-processing of any dataset could possibly serve as an alternative to well-designed experiments. Keywords: Algorithmic Fairness, Kernel Methods 1. Introduction Machine learning algorithms trained on historical data may inherit implicit biases which can in turn lead to potentially unfair outcomes for some individuals or minority groups. For instance, gender-bias may be present in a historical dataset on which a model is trained to automate the postgraduate admission process at a university. This may in turn render the algorithm biased, leading it to inadvertently generate unfair decisions. In recent years, a large body of work has been dedicated to systematically addressing this problem, whereby various notions of fairness have been considered, see, e.g. (Calders et al., 2009; Zemel et al., 2013; Louizos et al., 2015; Hardt et al., 2016; Joseph et al., 2016; Kilbertus et al., 2017; Kusner et al., 2017; Calmon et al., 2017; Zafar et al., 2017; Kleinberg et al., 2017; Donini et al., 2018; Madras et al., 2018; Oneto et al., 2020), and references therein. Among the several algorithmic fairness criteria, one important objective is to ensure that a model s prediction is not influenced by the presence of sensitive information in the data. In this paper, we address this objective from the perspective of (fair) representation learning. Thus, a central question which forms the basis of our work is as follows. Can the observed features be replaced by close approximations that are independent of the sensitive ones? c 2021 Steffen Grünewälder and Azadeh Khaleghi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-1311.html. GRÜNEWÄLDER AND KHALEGHI More formally, assume that we have a dataset such that each data-point is a realization of a random variable (X, S) where S and X are in turn vector-valued random variables corresponding to the sensitive and non-sensitive features respectively. We further allow X and S to be arbitrarily dependent, and ask whether it is possible to generate a new random variable Z which is ideally independent of S and close to X in some meaningful probabilistic sense. As an initial step, we may assume that X is zero-mean, and aim for decorrelation between Z and X. This can be achieved by letting Z = X ESX where ESX is the conditional expectation of X given S. The random variable Z so-defined is not correlated with S and is close to X. In particular, it recovers X if X and S are independent. In fact, under mild assumptions, Z gives the best approximation (in the mean-squared sense) of X, while being uncorrelated with S. Observe that while the distribution of Z differs from that of X, this new random variable seems to serve the purpose well. For instance, if S corresponds to a subject s gender and X to a subject s height, then Z corresponds to height of the subject centered around the average height of the class corresponding to the subject s gender. The key contributions of this work, briefly summarized below, are theoretical; we also provide an evaluation of the proposed approach through experiments in the context of classification and regression1. Before giving an overview of our results, we would also like to point out that while our techniques help reduce the bias, it is important to note that no post-processing of any dataset could possibly serve as an alternative to well-designed experiments. Contributions. Building upon this intuition, and using results inspired by testing for independence using the Maximum Mean Discrepancy (MMD) criterion (see e.g. (Gretton et al., 2008)), we obtain a related optimization problem in which X and ESX are replaced with Hilbert-space-valued random variables and Hilbert-space-valued conditional expectations. While the move to Hilbert spaces does not enforce complete independence between the new features and the sensitive features, it helps to significantly reduce the dependencies between the features. The new features Z have various useful properties which we explore in this paper (the notation Z is used to highlight that the features attain values in a Hilbert space and not in the space in which our samples lie). They are also easy to generate from samples (X1, S1), . . . , (Xn, Sn). The main challenge in generating the oblivious features Z1, . . . , Zn is that we do not have access to the Hilbert-space-valued conditional expectation and need to estimate it from data. Since we are concerned with Reproducing Kernel Hilbert Spaces (RKHSs) here, we use the reproducing property to extend the plugin approach of Grünewälder (2018) to the RKHS setting and tackle the estimation problem. We further show how estimation errors can be controlled. After obtaining the empirical estimates of the conditional expectations, we generate oblivious features as well as an oblivious kernel matrix to be used as input to any kernel method. This guarantees a significant reduction in the dependence between the predictions and the sensitive features. We cast the objective of finding oblivious features Z which approximate the original features X well while maintaining minimal dependence on the sensitive features S, as a constrained optimization problem. Making use of Hilbert-space-valued conditional expectations, we provide a closed form solution to the optimization problem proposed. Specifically, we first prove that our solution satisfies the constraint of the optimization problem at hand, and show via Proposition 4 that it is indeed optimal. Through Proposition 2 we relate the strength of the dependencies between Z and S to how close Z lies to the low-dimensional manifold corresponding to the image under the feature map φ. This result is key in providing some insight into the interplay between probabilistic independence and approximations in the Hilbert space. We extend known estimators for real-valued 1. Our implementations are available at https://github.com/azalk/Oblivious.git. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS conditional expectations to estimate those taking values in a Hilbert space, and show via Proposition 5 how to control their estimation errors. This result in itself may be of independent interest in future research concerning Hilbert-space-valued conditional expectations. We provide a method to generate oblivious features and the oblivious kernel matrix which can be used instead of the kernel matrix to reduce the dependence of the prediction on the sensitive features; the computational complexity of the approach is O(n2). Related Work. Among the vast literature on algorithmic fairness, (Donini et al., 2018; Madras et al., 2018; Oneto et al., 2020), which fit into the larger body of work on fair representation learning, are closest to our approach. Madras et al. (2018) describe a general framework for fair representation learning. Their approach is inspired by generative adversarial networks and is based on a game played between generative models and adversarial evaluations. Depending on which function classes one considers for the generative models and for the adversarial evaluations one can describe a vast array of approaches. Interestingly, it is possible to interpret our approach in this general context: the encoder f corresponds to a map from X and S to H, where our new features Z live. We do not have a decoder but compare features directly (one could also take our decoder to be the identity map). Our adversary is different from that used by Madras et al. (2018). In their approach a regressor is inferred which maps the features to the sensitive features, while we compare sensitive features and new features by applying test functions to them. The regression approach performs well in their context because they only consider finitely many sensitive features. In the more general framework considered in the present paper where the sensitive features are allowed to take on continuous values, this approach would be sub-optimal since it cannot capture all dependencies. Finally, we ignore labels when inferring new features. It is also worth pointing out that our approach is not based on a game played between generative models and an adversary but we provide closed form solutions. On other hand, while the focus of Donini et al. (2018) is mostly on empirical risk minimization under fairness constraints, the authors briefly discuss representation learning for fairness as well. In particular, Equation (13) in the reference paper effectively describes a conditional expectation in Hilbert space, though it is not denoted or motivated as such. The conditional expectation is based on the binary features S only and the construction is applied in the linear kernel context to derive new features. The paper does not go beyond the linear case for representation learning but there is a clear link to the more general notions of conditional expectation on which we base our work. We discuss the relation to (Donini et al., 2018) in detail in Section 6.5 and we show how their approach can be extended beyond binary sensitive features by making use of our conditional expectation estimates. In (Oneto et al., 2020) the demographic parity constraint is considered in a multi-task setting. The sensitive features have to be discrete in this setting, attaining only finitely many values. The tasks share an unknown representation and it is assumed that this representation fulfills the demographic parity constraint. The authors propose an optimization problem that minimizes a sum of a multi-task loss function (for classification and regression) and a metric that quantifies how well demographic parity is achieved. The parameters of this optimization problem are the potential representations and task specific functions. Both the Sinkhorn divergence and MMD are considered as metrics to quantify demographic parity. The focus is on a case where the potential representations are one-layer networks and the task specific functions are linear. Organization. The rest of the paper is organized as follows. In Section 2 we introduce our notation and provide preliminary definitions used in the paper. Our problem formulation and optimization objective are stated in Section 3. As part of the formulation we also define the notion of H- GRÜNEWÄLDER AND KHALEGHI independence between Hilbert-space-valued features and the sensitive features. In Section 4 we study the relation between H-independence and bounds on the dependencies between oblivious and sensitive features. In Section 5 we provide a solution to the optimization objective. In Section 6 we derive an estimator for the conditional expectation and use it to generate oblivious features and the oblivious kernel matrix. We provide some empirical evaluations in Section 7, and end with some concluding remarks and open directions in Section 8. 2. Preliminaries In this section we introduce some notation and basic definitions. Consider a probability space (Ω, A, P). For any A A we let χA : Ω {0, 1} be the indicator function such that χA(ω) = 1 if, and only if, ω A. Let X be a measurable space in which a random variable X : Ω X takes values and S be a measurable space in which the random variables S : Ω S takes values. We denote by σ(X) the σ-algebra generated by X. Let H be an RKHS composed of functions h : X R and denote its feature map by φ(x) : X H where, φ(x) = k(x, ) for some positive definite kernel k : X X R. As follows from the reproducing kernel property of H we have φ(x), h = h(x) for all h H. Moreover, observe that φ(X) is in turn a random variable attaining values in H. In Appendix A we provide some technical details concerning Hilbert-space-valued random variables such as φ(X). Conditional Expectation. For the random variable X and S defined above, we denote by ESX the random variable corresponding to Kolmogorov s conditional expectation of X given S, i.e. ESX = E(X|σ(S)), see, e.g. (Shiryaev, 1989). Recall that in a special case where S = {0, 1} we simply have E(X|S = 0)χ{S = 0} + E(X|S = 1)χ{S = 1} where, E(X|S = i) is the familiar conditional expectation of X given the event {S = i} for i = 0, 1. Thus, in this case, the random variable ESX is equal to E(X|S = 0) if S attains value 0 and is equal to E(X|S = 1) otherwise. Note that the above example is for illustration only, and that X and S may be arbitrary random variables: they are not required to be binary or discrete-valued. Unless otherwise stated, in this paper we use Kolmogorov s notion of conditional expectation. We will also be concerned with conditional expectations that attain values in a Hilbert space H, which mostly behave like real-valued conditional expectations (see (Pisier, 2016) and Appendix B for details). Next, we introduce Hilbert-space-valued L2-spaces which play a prominent role in our results. Hilbert-space-valued L2-spaces. For a Hilbert space H, we denote by L2(H) = L2(Ω, A, P; H) the H-valued L2 space. If H is an RKHS with a bounded and measurable kernel function then φ(X) is an element of L2(Ω, A, P; H). The space L2(Ω, A, P; H) consists of all (Bochner)-measurable functions X from Ωto H such that E( X 2) < (see Appendix A for more details). We call these functions random variables or Hilbert-space-valued random variables and denote them with bold capital letters. As in the scalar case we have a corresponding space of equivalence classes which we denote by L2(Ω, A, P; H). For X, Y L2(Ω, A, P; H) we use X , Y for the corresponding equivalence classes in L2(Ω, A, P; H). The space L2(Ω, A, P; H) is itself a Hilbert space with norm and inner product given by X 2 2 = E( X 2) and X , Y 2 = E( X, Y ), where we use a subscript to distinguish this norm and inner product from the ones from H. The norm and inner product have a corresponding pseudo-norm and bi-linear form acting on L2(H) and we also denote these by 2 and , 2. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS Low dependence region φ[X] Figure 1: (a) The three main random variables in Problem 1 are shown. The non-sensitive features X attains values in X and is mapped onto the RKHS H through the feature map φ; the sensitive features S attains values in S, and Z attains values in H. All three random variables are defined on the same probability space (Ω, A, P). (b) The image of X under φ is sketched (blue curve). This is a subset of H whose projection onto the subspace spanned by two orthonormal basis elements e1 and e2 is shown here. The set φ[X] is a low-dimensional manifold if φ is continuous. The element h = E(φ(X)) lies in the convex hull of φ[X]. Intuitively, if Z attains values mainly in the gray shaded area then Z is only weakly dependent on S. 3. Problem Formulation We formulate the problem as follows. Given two random variables X : Ω X and S : Ω S corresponding to non-sensitive and sensitive features in a dataset, we wish to devise a random variable Z : Ω X which is independent of S and closely approximates X in the sense that for all Z : Ω X we have, Z X 2 Z X 2. (1) Dependencies between random variables can be very subtle and difficult to detect. Similarly, completely removing the dependence of X on S without changing X drastically is an intricate task that is rife with difficulties. Thus, we aim for a more tractable objective, described below, which still gives us control over the dependencies. We start by a strategic shift from probabilistic concepts to interactions between functions and random variables. Consider the RKHS H of functions h : X R with feature map φ as introduced in Section 2, and assume that H is large enough to allow for the approximation of arbitrary indicator functions χ{Z A }, A a measurable subset of X, in the L2-pseudo-norm for any X-valued random variable Z. Observe that if E(h(Z) g(S)) = E(h(Z)) E(g(S)) (2) for all h H, g L2 then Z and S are, indeed, independent. This is because h and g can be used to approximate arbitrary indicator functions. In particular, for any measurable subset A of X and B of S there exist functions h and g such that P({Z A } {S B }) E(h(Z) g(S)) = E(h(Z)) E(g(S)) P(Z A ) P(S B ). This means that the independence constraint of the optimization problem of (1) translates to (2). Note that using RKHS elements as test functions is a common approach for detecting dependencies and is used in the MMD-criterion (e.g. (Gretton et al., 2008)). GRÜNEWÄLDER AND KHALEGHI On the other hand, due to the reproducing property of the kernel of H, we can also rewrite the constraint (2) as E( h, φ(Z) g(S)) = E h, φ(Z) E(g(S)). (3) Observe that φ(Z) is a random variable that attains values in a low-dimensional manifold; if the kernel function is continuous and X = Rd then the image φ[X] of X under φ is a d-dimensional manifold which we denote in the following by M. In Figure 1 this manifold is visualized as the blue curve. Therefore, while Equation (3) is linear in φ(Z), depending on the shape of the manifold, it can lead to an arbitrarily complex optimization problem. We propose to relax (3) by moving away from the manifold, replacing φ(Z) with a random variable Z : Ω H which potentially has all of H as its range. This simplifies the original optimization problem to one over a vector space under a linear constraint. To formalize the problem, we rely on a notion of H-independence introduced below. Definition 1 (H-Independence) We say Z L2(Ω, A, P; H) and S : Ω S are H-independent if and only if for all h H and all bounded measurable g : S R it holds that, E( h, Z g(S)) = E h, Z E(g(S)). Thus, instead of solving for Z : Ω X in (1), we seek a solution to the following optimization problem. Problem 1 Find Z L2(Ω, A, P; H) that is H-independent from S (in the sense of Definition 1) and is close to X in the sense that Z φ(X) 2 Z φ(X) 2 for all Z which are also H-independent of S. Observe that the H-independence constraint imposed by Problem 1, ensures that all non-linear predictions based on Z are uncorrelated with the sensitive features S. The setting is summarized in Figure 1(a). Projection onto M. If Z lies in the image of φ and H is a large RKHS then H-independence also implies complete independence between the estimator ˆh, Z and S. To see this, assume that there exists a random variable W : Ω X such that Z = φ(W) and that the RKHS is characteristic. Since for any f H and bounded measurable g : S R E(f(W) g(S)) = E( f, Z g(S)) = E f, Z E(g(S)) = E(f(W)) E(g(S)) we can deduce that W and S is independent. Moreover, since Z is a function of W it is also independent of S. In general, Z will not be representable as some φ(W) and there can be dependencies between ˆh, Z and S. However, if Z attains values close to the manifold M then we can find a random variable W such that φ(W) is close to Z and the dependence between φ(W) and S is controlled by how close Z is to the manifold. More exactly, we allow Z to be translated by h H before measuring the distance. This is important because the manifold itself can lie away from the origin while the Z we construct in Section 5 lies around the origin. The distance we consider is d2(Z + h , M) := E( inf h M Z + h h 2), OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS the average Hilbert space distance between Z + h and the manifold. Observe that the expectation on the right side is well defined when M is compact since we can then replace M with a countable dense subset of M. Showing that a suitable W exists is not trivial; the difficulty is that for values that Z might attain in H there can be many points on the manifold closest to that value and selecting points on the manifold in a way that makes the random variable W well defined needs a result on measurable selections. The following proposition makes use of such a selection and guarantees the existence of a suitable W, i.e. it states that there exists a random variable W such that φ(W) achieves the minimal distance to Z + h . Proposition 1 Consider Z L2(Ω, A, P; H), assume that the kernel function is continuous and (strictly) positive-definite, and X is compact. For any h H there exists a σ(Z)-measurable random variable W which attains values in X such that φ(W) L2(Ω, A, P; H) and Z + h φ(W) 2 = d(Z + h , M). The proof is provided in Appendix C.1. We will call such a variable W provided by the proposition a projection of Z on M. The variable W can be approximated algorithmically for a given Z and h (see Appendix E.3). Furthermore, φ(W) is a good approximation of φ(X) whenever Z is, as φ(W(ω)) φ(X(ω)) φ(W(ω)) Z(ω) + Z(ω) φ(X(ω)) 2 Z(ω) φ(X(ω)) , where we used that φ(W(ω)) is closest to Z(ω) on M = φ[X]. Therefore, φ(W) φ(X) 2 2 Z φ(X) 2. 4. Bounding the dependencies A common approach to quantifying the dependence between random variables is to consider |P(A B) P(A)P(B)| where A and B run over suitable families of events. In our setting, these families are the σ-algebras σ(Z) (or, alternatively, σ(W)) and σ(S), and the difference between P(A B) and P(A)P(B), A σ(Z) or A σ(W), B σ(S), quantifies the dependence between the random variables Z and S, and W and S, respectively. Upper bounds on the absolute difference of these two quantities are related to the notion of α-dependence which underlies α-mixing. In times-series analysis mixing conditions like α-mixing play a significant role since they provide means to control temporal dependencies (see, e.g., (Bradley, 2007; Doukhan, 1994)). The aim of this section is to show how the notion of H-independence is related to the dependence between the random variables. In particular, Proposition 2 below states a bound on the dependence between W and S in terms of the distance of Z to the manifold M. Furthermore, if Z and φ(W) are closely coupled in the sense that there exists a constant c such that for any event A1 σ(Z) there exist an event A2 σ(W) fulfilling P(A1 A2) c then the dependence between Z and S can also be bounded. For the bound to be useful we want a small value of c for which the above holds, e.g. if we let c = 1 then the above holds trivially but the bound we provide below becomes vacuous. In this context, observe that W, GRÜNEWÄLDER AND KHALEGHI as constructed above, is a function of Z and we know that σ(W) σ(Z). However, the opposite inclusion is not guaranteed to hold. Coming back to bounding the dependence between W and S: the high level idea is that Hindependence would correspond to normal independence if we had function evaluations h(Z) instead of inner products h, Z (given that H is sufficient to approximate indicator functions). While generally there is no such expression for the inner product we know that for φ(W) we actually have the equivalence h, φ(W) = h(W) due to the reproducing property of the kernel function. In contrast to Z the random variable φ(W) does not need to be H-independent of S, however, if Z +h and φ(W) are not too far from each other in 2-norm then φ(W) will be approximately Hindependent of S and we can say something about the dependence between W and S. Therefore, the bound below is stated in terms of Z +h φ(W) 2, which is equal to the distance between Z +h and M, and a measure of how well indicator functions can be approximated. More specifically, the bound is controlled by the functional ψ(A) = inf f H 2 χA(W) f(W) 2 + f d(Z + h , M), (4) where A {W[C] : C σ(W)} and f has to balance between approximating the indicator function while keeping f d(Z + h , M) small. The function ψ has a natural interpretation as the minimal error that can be achieved in a regularized interpolation problem. If H lies dense in a certain space, then any relevant indicator can in principle be approximated arbitrary well. This is not saying that ψ(A) will be small since the norm of the element that approximates the indicator might be large. But the approximation error, which is χA(W) f(W) 2, can be made arbitrary small. With this notation in place the proposition is as follows. Proposition 2 Consider a Z L2(Ω, A, P; H) which is H-independent from S, suppose that the kernel function is continuous and (strictly) positive-definite, and X is compact. Let W be a projection of Z on M. For any A σ(W) and B σ(S), with A = W[A] being the image of A under W, the following holds, |P(A B) P(A)P(B)| ψ(A ). Furthermore, for A σ(Z), if c > 0 is such that BA = {W[C] : C σ(W), P(C A) c} is non-empty then for any B σ(S), |P(A B) P(A)P(B)| 2c + inf D BA ψ(D). The proof is provided in Appendix C.2. Intuitively, as visualized in Figure 1, the proposition states that if Z mostly attains values in the gray area then the dependence between W and S is low and, if W and Z are strongly coupled, then the dependence between Z and S is also low. 4.1 Estimating ψ(A) The key quantity in Proposition 2 is ψ(A). To control ψ(A) it is necessary to control how well the RKHS can approximate indicators and to estimate the distance d(Z + h , M). The former problem is more difficult and might be approached using the theory of interpolation spaces; we do not try to develop the necessary theory here but only mention a simple result on denseness at the end of this section. On the other hand, the latter problem is easy to deal with: the distance d(Z + h , M) OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS between Z + h and M can be estimated efficiently. In the case where the space X is compact and φ is a continuous function, we propose an empirical estimate of d(Z + h , M) given by dn(Z + h , M) := 1 i=1 min h M Zi + h h (5) where Zi, i n, n N, are n independent copies of Z. Note that the compactness of X together with the continuity of φ make the min operator in (5) well-defined. Proposition 3 Consider a Z L2(Ω, A, P; H) which is H-independent from S, suppose that the kernel function is continuous and (strictly) positive-definite, and X is compact. Let ρ = maxx X φ(x) < . For any h H with h ρ and every ϵ > 0 we have, Pr(|dn(Z + h , M) d(Z + h , M)| ϵ) 2 exp 2nϵ2 The proof is provided in Appendix C.3. Going back to the approximation error χA(W) f(W) 2, where A X is the image under W of some set C σ(W) and f H we like to mention the following: let ν = PW 1 be the push-forward measure of P under W. If H lies dense in L2(X, B, ν) then for any such A and any ϵ > 0 there exists a function f such that χA(W) f(W) 2 < ϵ, i.e. for the measurable set A there exists a function f H such that Z (χA(W) f(W))2 d P = Z (χA(x) f(x))2 dν(x) < ϵ2, using (Fremlin, 2001, Theorem 235Gb). In many cases the continuous functions C(X) lie dense in L2(X, B, ν) and a universal RKHS H is sufficient to approximate the indicators χW (see (Sriperumbudur et al., 2011)). 5. Best H-independent features In this section we discuss how to obtain Z as a closed-form solution to Problem 1. To this end, inspired by the sub-problem in the linear case, we obtain Z using Hilbert-space-valued conditional expectations. We further show that these features are H-independent of S and that Z is the best H-independent approximation of φ(X). In the linear case discussed in the Introduction it turned out that Z = X ESX + EX is a good candidate for the new features Z. In the Hilbert-space-valued case a similar result holds. The main difference here is that we do have to work with Hilbert-space-valued conditional expectations. For any random variable φ(X) L2(Ω, A, P; H), and any σ-subalgebra B of A, conditional expectation EBφ(X) is defined and is again an element of L2(Ω, A, P; H). We are particularly interested in conditioning with respect to the sensitive random variable S. In this case, B is chosen as σ(S), the smallest σ-subalgebra which makes S measurable, and we denote this conditional expectation by ESφ(X). A natural choice for the new features is Z = φ(X) ESφ(X) + E(φ(X)). (6) The expectation E(φ(X)) is to be interpreted as the Bochner-integral of φ(X) given measure P. Importantly, if S and φ(X) are independent, we have with this choice that Z = φ(X) = φ(X) and we are back to the standard kernel setting. Also, if φ(X) L2(Ω, A, P; H) then so is Z. GRÜNEWÄLDER AND KHALEGHI 0.00 0.25 0.50 0.75 1.00 S 0.00 0.25 0.50 0.75 1.00 S X = S + U and h1(x) = x2 X = S + U and h2(x) = sin(x) Figure 2: The figure shows data from two different settings. In the left two plots X = S + U, where S and U are independent, S is uniformly distributed on [0, 1] and U is uniformly distributed on [ 1/2, 1/2]. The function h1 is the quadratic function. The leftmost plot shows h1(X) against S and the plot to its right shows a centered version of h1, Z plotted against S. Similarly for the right plots with the difference that S is uniformly distributed on [0, 2π] and U is uniformly distributed on [0, π/2]. The function h2(x) is sin(x). The red curves show the best regression curve, predicting h1(X) and h2(X) using S. We can verify that the features Z are, in fact, H-independent of S. In particular, for any h H and g L2, E( φ(X) ESφ(X), h g(S)) = E(φ(X) g(S)) E((ESφ(X)) g(S)), h = E(φ(X) g(S)) E(ES(φ(X) g(S))), h = 0. Since E(φ(X)) is a constant this implies that E( Z, h g(S)) = E(h(X)) E(g(S)) A similar argument shows that E Z, h = E(h(X)). Thus, Z is H-independent of S. In Figure 2 the effect of the move from φ(X) to Z is visualized. In the figure S is plotted against h1(X) and h2(X) (blue dots), where h1 corresponds to the quadratic function and h2 to the sinus function. The dependencies between h1(X) and S, as well as h2(X) and S, are high and there is clear trend in the data. The two red curves correspond to the best regression functions, using S to predict h1(X) and h2(X). The relation between the new features and S is shown in the other two plots (gray dots). In the case of h1 one can observe that the dependence between h1, Z and S is much smaller and, by the design of Z, h1, Z and S are uncorrelated. Similarly, for h2, Z , whereas here the dependence to S seems to be even lower and it is difficult to visually verify any remaining dependence between S and h2, Z . An interesting aspect of this transformation from X to Z is that Z is automatically uncorrelated with S for all functions h in the corresponding RKHS, without the need to ever explicitly consider a particular h. Besides being H-independent of S these new features Z also closely approximate our original features φ(X) if the influence from S is not too strong, i.e. the mean squared distance is E( φ(X) Z 2) = E( ESφ(X) E(φ(X)) 2) which is equal to zero if X is independent of S. In fact, Z is the best approximation of φ(X) in the mean squared sense under the H-independent constraint. This is essentially a property of OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS the conditional expectation which corresponds to an orthogonal projection in L2(Ω, A, P; H). We summarize this property in the following result. Proposition 4 Given φ(X), Z L2(Ω, A, P; H) such that Z is H-independent of S, then E( φ(X) Z 2) E( φ(X) Z 2), where Z = φ(X) ESφ(X) + E(φ(X)). Furthermore, Z is the unique minimizer (up to almost sure equivalence). The proof provided is in Appendix C.4. Change in predictions. When replacing φ(X) by Z we lose information (we reduce the influence of the sensitive features). An interesting question to ask is, how much does the reduction in information change our predictions? A simple way to bound the difference in predictions is as follows. Consider any h H, for instance corresponding to a regression function, then |h(X) h, Z | h φ(X) Z h ESφ(X) E(φ(X)) where ESφ(X) E(φ(X)) effectively measures the influence of S. Hence, the difference in prediction is upper bound by the norm of the predictor (here h) and a quantity that measures the dependence between S and φ(X). Example. To demonstrate that the effect of the move from X to Z can be profound we consider the following fundamental example: suppose that X and S are standard normal random variables with covariance c [ 1, 1] and consider the linear kernel k(x, y) = xy, x, y R. In this case φ(X) = X and ESX = c S is also normally distributed (see (Bertsekas and Tsitsiklis, 2002)[Sec4.7]). Hence, Z = X ESX +E(X) is normally distributed and E(Z S) = c c E(S2) = 0. This implies that Z and S are, in fact, fully independent, regardless of how large the dependence between the original features X and the sensitive features S may be. In the case where X and S are fully dependent, i.e. X = a S for some a R, the features Z are equal to zero and do not approximate X. Next, consider a polynomial kernel of second order such that the quadratic function h(x) = x2 lies within the corresponding RKHS. The inner product between this h and Z is equal to X2 ESX2+E(X2) and is not independent of S. Hence, the kernel function affects the dependence between Z and S. Also, within the same RKHS there lie linear functions and for any linear function h it holds that Z, h is independent of S. Therefore, within the same RKHS we can have directions in which Z is independent of S and directions where both variables are dependent. 6. Generating oblivious features from data To be able to generate the features Z we need to first estimate the conditional expectation ESφ(X) from data. To this end, we devise a plugin-approach. After introducing this approach in Section 6.1 we discuss how the estimation errors of the plugin-estimator can be controlled in Section 6.2. In Section 6.3 we show how the oblivious features can be generated. Finally, in Section 6.4, we demonstrate how the approach can be applied to statistical problems and we discuss relations to the approach of Donini et al. (2018) in Section 6.5. GRÜNEWÄLDER AND KHALEGHI 6.1 Plug-in estimator A common method for estimation is the plug-in approach whereby an unknown probability measure is replaced by the empirical measure. This approach is used in (Grünewälder, 2018) for deriving estimators of conditional expectations. To see how the approach can be generalized to our setting, first observe that we can write ESφ(X) = g S almost surely, (7) where g : S H is a Bochner-measurable function (see Appendix A and Lemma 2 for details). Our aim is to estimate this function g from i.i.d. observations {(Xi, Si)}i n. For any subset B of the range space S of the sensitive features define the empirical measure Pn(S B) = (1/n) Pn i=1 δSi(B), where δSi the Dirac measure with mass one at location Si. We define an estimate of the conditional expectation of φ(X) given that the sensitive variable falls into a set B by En(φ(X)|S B) = 1 n Pn(S B) i=1 φ(Xi) δSi(B), when Pn(S B) > 0 and through En(φ(X)|S B) = 0 otherwise. Observe that for h H we have, h, 1 n Pn(S B) i=1 φ(Xi) δSi(B) = 1 n Pn(S B) i=1 h(Xi) δSi(B). We can also write this as h, En(φ(X)|S B) = En(h(X)|S B). An estimate of the conditional expectation given S is provided by ES nφ(X) = X B S En(φ(X)|S B) χ{S B}, where S is a finite partition of the range space S of S. A common choice for S if S is the hypercube [0, 1]d, d 1, are the dyadic sets. Observe, that we can move inner products inside the conditional expectation ES nφ(X) so that h, ES nφ(X) = ES nh(X), where ES nh(X) is the empirical conditional expectation introduced in (Grünewälder, 2018). 6.2 Controlling the estimation error The estimation error when estimating ESφ(X) using ES nφ(X) is relatively easy to control thanks to the plug-in approach. Essentially, standard results concerning the empirical measure carry over to conditional expectation estimates in the real-valued case (Grünewälder, 2018). But through scalarization we can transfer some of these results straight away to the Hilbert-space-valued case. For instance, using φ(X) in place of φ(X), En(φ(X)|S B) E(φ(X)|S B) = sup h 1 | En(φ(X)|S B) E(φ(X)|S B), h | = sup h 1 |En(h(X)|S B) E(h(X)|S B)| OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS and bounds on the latter term are known. Similarly, ES nφ(X) ESφ(X) = sup h 1 |ES nh(X) ES(h(X))|. (8) However, both ES nφ(X) and ESφ(X) are random variables and a useful measure of their difference is the L2-pseudo-norm. The L2-pseudo-norm should in this case not be taken with respect to P itself but conditional on the training sample. Hence, for i.i.d. pairs (X, S), (X1, S1), . . . , (Xn, Sn) let Fn = σ(X1, S1, . . . , Xn, Sn) and define the conditional L2-pseudo-norm by ES nφ(X) ESφ(X) 2 2,n = EFn ES nφ(X) ESφ(X) 2. Substituting Equation (8) in shows that this expression is equal to EFn sup h 1 |ES nh(X) ESh(X)|2 . The supremum cannot be taken out of the conditional expectation, however, by writing ES nh(X) and ESh(X) as simple functions (see Appendix A.1) we can get around this difficulty and control the error in 2,n. We demonstrate this in the following by deriving rates of convergence for two cases: for the case where S is finite, and for the case where S is the unit cube in Rd for some d 1 and S has a density that is bounded away from zero. To derive these rates we rely, among other things, on the convergence of the empirical process uniformly over families of functions related to the unit ball of H and partitions of S. For instance, in the case where S is finite we need to assume that HS := {(h π1) χ(X {s}) : h H, h 1, s S}, as a family of real-valued functions on X S, is a P-Donsker class. The function π1 : X S X is here the projection onto the first argument, i.e. π1(x, s) = x. For the definition of P-Donsker classes see (Dudley, 2014; Giné and Nickl, 2016). There are various ways to verify this condition in concrete settings. For example, if H is a finite dimensional RKHS then HS is a P-Donsker class under a mild measurability assumption. This follows from a few simple arguments: any finite dimensional space of functions is a VC-subgraph class (Giné and Nickl, 2016, Ex.3.6.11); this implies directly that {(h π1) χ(X {s}) : h H, h 1} is a VC-subgraph class for every s S. Furthermore, finite unions of VC-subgraph classes are again a VC-subgraph class; under a mild measurbility assumption it follows now from (Dudley, 2014, Cor.6.19) that HS is a P-Donsker class. There are obviously other ways to prove this statement. In particular, one might use that the unit ball of H is a universal Donsker class (see (Dudley, 2014; Giné and Nickl, 2016) for details) when the kernel function is continuous and X is compact (this also holds when H is infinite dimensional): due to Marcus (1985) the unit ball of a Hilbert space is a universal Donsker class if supx X |h(x)| c h for some constant c that does not depend on h. If the kernel function is bounded c = p k(x, x) witnesses that this property holds. Case 1: finitely many sensitive features. Our first proposition states that the estimator converges with the optimal rate n 1/2 when S is finite and HS is a P-Donsker class. Proposition 5 Given a finite space S and a P-Donsker class HS, it holds that ES nφ(X) ESφ(X) 2,n O P (n 1/2). The proof is given in Appendix C.5. GRÜNEWÄLDER AND KHALEGHI Case 2: [0, 1]d-valued sensitive features. We extend Proposition 5 to the case where S is not confined to taking finitely many values. In order to state the result, we introduce the following notation. Set S := [0, 1]d for some d N and let g : S H be such that with probability one ESφ(X) = g S (which is possible by Lemma 2). Consider a discretization of S into dyadic cubes 1, 2, . . . , ℓd of side-length 1/ℓfor some ℓ N. Define Cℓ:= {X : Dℓ} and let HC := {h χD : h H, h 1, D S ℓ N Cℓ}. Proposition 6 Suppose that the push forward measure µ := PS 1 has density u with respect to the Lebesgue measure λ on S with the property that infs S u(s) b for some b > 0. Assume that g S is L-lipschitz continuous and that HC is a P-Donsker class. We have ES nφ(X) ESφ(X) 2,n O P (n 1 2(d+1) ). The proof is given in Appendix C.6. 6.3 Generating an oblivious random variable Given a data-point (X, S) composed of non-sensitive and sensitive features X and S respectively, we can generate an oblivious random variable Z as Z := φ(X) ES nφ(X) + En(φ(X)). (9) Most kernel methods work with the kernel matrix and do not need access to the features themselves. The same holds in our setting. More specifically, we never need to represent Z explicitly in the Hilbert space but only require inner-product calculations. In order to calculate the empirical estimates of the conditional expectation ES nφ(X) and of En(φ(X)) in (9) we consider a simple approach whereby we split the training set into two subsets of size n, and use half the observations to obtain the empirical estimates of the expectations. The remaining n observations are used to obtain an oblivious predictor; we have two cases as follows. Case 1 (M-Oblivious). The standard kernel matrix K is calculated with the remaining n observations and a kernel-method is applied to K to obtain a predictor g. When applying the predictor to a new unseen data-point (X, S) we first transform X into Z via (9) and calculate the prediction as g, Z . As discussed in the Introduction, we conjecture that this approach is suitable in the case where the labels Y are conditionally independent of the sensitive features S given the non-sensitive features X, i.e. when S, X, Y form a Markov chain S X Y . As such we call this approach M-Oblivious. Case 2 (Oblivious). Instead of calculating the kernel matrix K an oblivious kernel matrix, i.e. Z1 2 Z1, Zn ... ... ... Zn, Z1 Zn 2 is calculated by applying Equation (9) to the remaining training samples (Xi, Si) before taking inner products. The oblivious matrix is then passed to the kernel-method to gain a predictor g. The matrix is positive semi-definite since a Oa = Pn i=1 ai Zi 2 0, for any a Rn. The complexity to compute the matrix is O(n2) (see Appendix E for details on the algorithm). Prediction for a new unseen data-point (X, S) is now done in the same way as in Case 1. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS 6.4 Oblivious ridge regression In this section we showcase our approach in the context of kernel ridge regression. We have three relevant random variables, namely, the non-sensitive features X, the sensitive features S and labels Y which are real valued. We assume that we have 2n i.i.d. observations {(Xi, Si, Yi)}i 2n. We use the observations n + 1, . . . , 2n to generate the oblivious random variables Zi and then use oblivious data {(Zi, Yi)}i n for oblivious ridge regression (ORR). The ORR problem has the following form. Given a positive definite kernel function k : X X R, a corresponding RKHS H and oblivious features Zi. Our aim is to find a regression function h H such that the mean squared error between h, Z and Y is small. Replacing the mean squared error by the empirical least-squares error and adding a regularization term for h gives us the optimization problem ˆh = arg min h H i=1 ( h, Zi Yi)2 + λ h 2, (11) where λ > 0 is the regularization parameter. It is easy to see that the setting is not substantially different from standard kernel ridge regression and derive a closed form solution for ˆh. More specifically, we have a representer theorem in this setting which tells us that the minimizer lies in the span of Z1, . . . , Zn. One can then solve the optimization problem in the same way as for standard kernel ridge regression, see Appendix D for details. The solution to the optimization problem is ˆh = Pn i=1 αi Zi, where α = (O + λI) 1y. The vector y is given by (Y1, . . . , Yn) . Predicting Y for a new observation (X, S) is achieved by first generating the oblivious features Z (see Appendix E.2) and then by evaluating Z, ˆh = Pn i=1 αi Z, Zi . 6.5 Comparison to (Donini, Oneto, Ben-David, Shawe-Taylor, and Pontil, 2018) Our focus in this paper is on generating features that are less dependent on the sensitive features than the original non-sensitive features. However, the conditional expectation ESφ(X), which is at the heart of our approach, also features prominently in methods that add constraints to SVM classifiers. In particular, in (Donini et al., 2018) a constraint is used to achieve approximately equal opportunity in classification where the sensitive feature is binary. While their approach does not make explicit use of conditional expectations one can recognize that the key object in their approach (Eq. (13) in (Donini et al., 2018)) is, in fact, closely related to our conditional expectation when used in the case where S can attain only two values (say S = {0, 1}). In detail, the optimization problem (14) is constraint by enforcing for a given ϵ > 0 that the solution h H fulfills |En(h (X)|S = 0) En(h (X)|S = 1)| ϵ. (12) Considering Z = φ(X) ESφ(X) + E(φ(X)) we can observe right away that in this setting for all h H, E( h, Z |S = 0) = E( h, Z |S = 1). To see this observe that ES h, Z is almost surely equal to the E(h(X)). In other words E( h, Z |S = 0) χ{S = 0} + E( h, Z |S = 1) χ{S = 1} = ES h, Z GRÜNEWÄLDER AND KHALEGHI Figure 3: Binary classification error vs. eβ-dependence between prediction and sensitive features is shown for three different methods: classical Linear SVM, Linear FERM, and Oblivious SVM. In Figure 3a the error is calculated with respect to the observed labels which are intrinsically biased and in Figure 3a the error is calculated with respect to the true fair classification rule. is almost surely constant. Unless P(S = 0) = 0 or P(S = 1) = 0 this implies that E( h, Z |S = 0) = E( h, Z |S = 1). Hence, for the max-margin classifier h and Z it holds that E( h , Z |S = 0) = E( h , Z |S = 1) and on the population level our new features Z guarantee that constraint (12) is automatically fulfilled. 7. Empirical evaluation In this section we report our experimental results for classification and regression. Our objective in the classification experiment is to point out an important property of supervised learning problems where sensitive features affect both the non-sensitive features and the labels: the estimation error of the observed labels can be misleading as a quality measure. The aim is much rather to predict values in an unbiased fashion. The first experiment highlights this difference by considering a synthetic data set for which we know the unbiased labels (though the true unbiased labels are not available to the methods). We measure the dependencies between the predicted values and the sensitive features, and compare against a standard SVM and to FERM. The second set of experiments aims to investigate how dependencies between sensitive and non-sensitive features affect ORR and M-ORR. We are investigating this relationship by considering a family of synthetic problems for which we can adjust the dependency between the features using a parameter γ. In this set of experiments we are also concerned with clarifying the relationship between ORR and M-ORR, where the latter is the M-Oblivious version of KRR, see Section 6.3. Our implementation can be found at the following repository: https://github.com/azalk/Oblivious.git. 7.1 Binary Classification We carried out an experiment to mimic a scenario where a class of students should normally receive grades between 0 and 5, and anyone with a grade above a fixed threshold θ = 2 should pass. Half of the class, representing a minority group , are disadvantaged in that their grades are almost systematically reduced, while the other half receive a boost on average. More specifically, let the OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS sensitive feature S be a {0, 1}-valued Bernoulli random variable with parameter 0.5, and let X0 be distributed according to a truncated normal distribution with support [1, 4]. Let the non-sensitive feature X, representing a student s grade, be given by X := (X0 B)χ{S = 0} + (X0 + B)χ{S = 1} where B is a Bernoulli random variable with parameter 0.9 independent of X0 and of S. The label Y is defined as a noisy decision influenced by the student s original grade X0 prior to the S-based modification. More formally, let U be a random variable independent of X0 and of S, and uniformly distributed on [0, 1]. Let Y0 := χ{U X0} and define Y := Y0χ{X + S θ}. Classification Error. In a typical classification problem, the labels Y depend on both X and S so when we remove the bias it is not clear what we should compare against when calculating the classification performance. Observe that our experimental construction here allows access to the true ground-truth labels Y := χ{X0 θ}. (13) Therefore, we are able to calculate the true (unbiased) errors as well. However, this is not always the case in practice. In fact, we argue that the question of how to evaluate fair classification performance is an important open problem which has yet to be addressed. Measure of Dependence. Let Fn := σ(X1, . . . , Xn, S1, . . . , Sn), n N be the σ-algebra generated by the training samples. In this experiment, we measure the dependence between the predicted labels b Y produced by any algorithm and the sensitive features S as eβ(b Y , S) := 1 y {0,1} E P(b Y = y, S = s|Fn) P(b Y = y|Fn)P(S = s) (14) which is closely related to the β-dependence (see, e.g. (Bradley, 2007, vol. I, p. 67)) between their respective σ-algebras. We obtain an empirical estimate of eβ(σ(b Y ), σ(S)) by simply replacing the probabilities in (14) with corresponding empirical frequencies. Synthetic Experiment. We generated n = 1000 training and test samples as described above and the errors reported for each experiment are averaged over 10 repetitions. Figure 3 shows binary classification error vs. dependence between prediction and sensitive features for three different methods: classical Linear SVM, Linear FERM, and Oblivious SVM. In Figure 3a the error is calculated with respect to the observed labels which are intrinsically biased and in Figure 3a the error is calculated with respect to the true fair classification rule Y given by (13). As can be seen in the plots, the true classification error of Oblivious SVM is smaller than that of the other two methods. Moreover, in both plots the β-dependence between the predicted labels produced by Oblivious SVM and the sensitive feature is close to 0 and is much smaller than that of the other two methods. Real-World Experiment. We evaluated the performance of our method on the so-called Adult dataset, which is a benchmark dataset publicly available on the UCI Machine Learning Repository (Dua and Graff, 2017). The dataset consists of more than 40K data-points, each composed of 14 features including an individual s age, education, marital status, gender, occupation, and capital GRÜNEWÄLDER AND KHALEGHI 0.0 0.2 0.4 0.6 0.8 1.0 Ridge Regression (Experiment 1) KRR ORR M-ORR 0.0 0.2 0.4 0.6 0.8 1.0 Ridge Regression (Experiment 2) KRR ORR M-ORR Figure 4: Plots 4a and 4b correspond to Ridge Regression Experiments 1 and 2 respectively. In both plots, the performance of three estimators (KRR,ORR and M-ORR) is plotted against γ, where γ controls the dependence of X on S. The case of γ = 0 corresponds to the highest dependence while γ = 1 corresponds to the case in which X is independent of S. gain/loss. The objective is to predict whether an individual s income exceeds $50K per year. Table 1 outlines the result of our comparison against standard SVM and the FERM algorithm of Donini et al. (2018), with gender marked as a sensitive feature. We used the repository provided by Donini et al. (2018) to extract the data and run the experiments corresponding to both standard SVM and FERM. The smaller option was used in data extraction, giving a total of 1628 training and 12661 testing data-points respectively. Both linear and Gaussian (RBF) kernels were considered. The Gaussian kernel was parametrized by γ > 0, i.e. k(x, x ) = e γ x x 2. The hyperparameters were selected using a 5-fold cross-validation. The SVM regularization parameter C was varied between 2 4, 2 3, . . . , 24, and γ was selected from {0.001, 0.01, 0.1, 1}. As can be observed in the table, our Oblivious (linear and non-linear) SVM reduces eβ-dependence between the label and the sensitive feature as desired, while incurring a minimal increase in Mean Prediction Error. On the other hand, FERM achieves the lowest DEO (see Donini et al. (2018)) for the linear kernel, while in the case of RBF kernel our model provides the highest reduction in DEO. 7.2 Ridge-Regression In this section we compare ORR with KRR and the Markov version of ORR, M-ORR, which applies the KRR solution to oblivious test features Z. We use an RBF kernel with σ = 1. We are particularly interested in how the dependence of S on X affects the performance and in a comparison of ORR to M-ORR. We use synthetic data to be able to control the dependence between S and X. The basic data generating process is as follows. Sensitive features S and non-sensitive features U are sampled independently from a uniform distribution with support [ 5, 5]. The features X are a convex combination of these two of the form X = γU + (1 γ)S, γ [0, 1]. We consider two ways to generate the response variable Y . In Experiment 1, the response variable is Y = X2 + ϵ, where ϵ is normally distributed with variance 0.1 and is independent of U and S. In this case S X Y forms a Markov chain and we expect M-ORR to do well. In Experiment 2, the variable S influences OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS Algorithm Mean Prediction Error DEO eβ-Dependence Relative eβ-Dependence SVM (Linear) 0.20 0.05 0.06 100% FERM (Linear) 0.20 0.03 0.07 116% Oblivious (Linear) 0.21 0.05 0.02 33% SVM (RBF) 0.17 0.22 0.14 100% FERM (RBF) 0.17 0.12 0.16 114% Oblivious (RBF) 0.19 0.03 0.06 43% Table 1: Comparison against standard SVM and FERM of Donini et al. (2018) on Adult benchmark dataset (Dua and Graff, 2017), with gender marked as sensitive feature. Our algorithm reduces the eβ-dependence between the label and the sensitive feature by 67% in the case of linear SVM and by 57% for non-linear SVM, while only incurring a minimal increase in Mean Prediction Error in both cases. FERM achieves the lowest DEO for the linear kernel, while our model provides the highest DEO reduction in the case of RBF kernel. Y also directly and not only through X, i.e. Y = X2 + S2 + ϵ. We use here S2 instead of S because S2 is not a zero mean random variable and cannot simply be consumed into the noise term. Figure 4 shows the results of these experiments. In these experiments, γ varies between 0, 0.1, 0.2 . . . , 1. For each value of γ we generate 500 data points for ORR and M-ORR to infer the conditional expectations and further 500 data points are used by all three methods to calculate the ridge regression solution. For simplicity, we fixed a partition for the conditional expectation: the set S = [ 5, 5] is split into a dyadic partition consisting of 16 sets. Each method uses a validation set of 100 data points (which are different from the 500 training data points) to select the regularization parameter λ from 2 5, 2 4, . . . , 25. A test set of size 100 is used to calculate the mean squared error (MSE). For each γ the experiment is repeated 20 times. Figure 4 reports the average MSE and the standard deviation of the MSE over these 20 experiments. We make the following observations from Figure 4a. KRR is the best estimator as it uses the features X directly and not the new features Z. As γ 1 both the ORR and M-ORR estimators approach the KRR estimator since the effect of S on X vanishes. Both estimators do not quite reach the performance of the KRR estimator. This is due to the additional uncertainty introduced by estimating the conditional expectations. By definition, the ORR estimator will achieve the best fit of the training data given the new features Z. We can observe that the M-ORR estimator is performing as well as the ORR estimator even though the M-ORR estimator uses the KRR solution and applies it to Z. This is due to the fact that S X Y forms a Markov chain. Finally, when γ = 0 both the M-ORR and ORR estimator achieve an MSE that is very close to the best MSE that can be achieved by a regressor that generates values which are independent of S: assume that some new features Z are given which are a function of X and are independent of S. when γ = 0 this random variable Z can only be independent of S if Z is a constant. However, if Z is a constant then the ridge-regressor using Z is also a constant and the MSE E(Y c)2 = E(S4) 2c E(S2) + c2 + 0.1 is minimized for c = E(S2). The minimal value is approximately 56.2 which is very close to the values of the ORR and M-ORR estimator. GRÜNEWÄLDER AND KHALEGHI Figure 4b shares a few characteristics with Figure 4a as follows. For γ = 0 both M-ORR and ORR attain an MSE that is close to the best possible (in the above sense) which is approximately equal to 224.8. As before, KRR is the overall best estimator and ORR is the best estimator using features Z. Furthermore, as γ 1 both estimators become close to the KRR solution. A crucial difference in this experiment is that S, X, Y does not form a Markov chain anymore and the performance of M-ORR is worse than that of ORR for values of γ between 0.2 and 0.8. The performance of M-ORR and ORR is essentially the same for γ = 0 and γ = 1. This is not surprising given that when γ = 0 then Y = 2S2 + ϵ and we are back in the Markov chain setting, while when γ = 1 then X is already independent of S. 8. Discussion We have introduced a novel approach to derive oblivious features which approximate non-sensitive features well while maintaining only minimal dependence on sensitive features. We make use of Hilbert-space-valued conditional expectations and estimates thereof; our plug-in estimators in this case can be of independent interest in future research, and in turn open grounds for interesting questions involving their guarantees. The application of our approach to kernel methods is facilitated by an oblivious kernel matrix which we have derived to be used in place of the original kernel matrix. We characterize the dependencies between the oblivious and the sensitive features in terms of how close the sensitive features are to the low-dimensional manifold φ[X]. One may wonder if this relation can be exploited to further reduce dependencies, and potentially achieve complete independence. Another important question concerns the interplay between the estimation errors introduced by estimating conditional expectations and the estimation errors introduced by kernel methods which are applied to the oblivious data. Appendix A. Probability in Hilbert spaces: elementary results In this section we summarize a few elementary results concerning random variables that attain values in a separable Hilbert space which we use in the paper. A.1 Measurable functions There are three natural definitions of what it means for a function f : Ω H to be measurable. Denote the measure space in the following by (Ω, A) with the understanding that these definitions apply, in particular, to Ω= Rd and A being the corresponding Borel σ-algebra. 1. f is Bochner-measurable iff f is the point-wise limit of a sequence of simple functions, where S : Ω H is a simple function if it can be written as i=1 hi χAi(ω) for some n N, A1, . . . , An A and h1, . . . , hn H. 2. f is strongly-measurable iff f 1[B] A for every Borel-measurable subset B of H. The topology that is used here is the norm-topology. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS 3. f is weakly-measurable iff for every element h H the function h, f : Ω R is measurable in the usual sense (using the Borel-algebra on R). All three definitions of measurability are equivalent in our setting. We call a function f : Ω H a random variable if it is measurable in this sense. The main example in our paper is f = φ(X). This is a well defined random variable whenever X : Ω Rd is a random variable and φ : Rd H is Bochner-measurable. In particular, when the kernel function is continuous φ is Bochner-measurable. Appendix B. Hilbert space-valued conditional expectations B.1 Basic properties We recall a few important properties of Hilbert space valued conditional expectations. These often follow from properties of real-valued conditional expectations through scalarization (Pisier, 2016). In the following, let X, Z L2(Ω, A, P; H) and B some σ-subalgebra of A. Due to Pisier (2016)[Eq. (1.7)], for any f H f, EBX = EB f, X (a.s.) (15) and the right hand side is just the usual real-valued conditional expectation. It is also worth highlighting that the same holds for the Bochner-integral E(X), i.e. for any f H, f, E(X) = E f, X . This can be used to derive properties of EBX. For instance, since E(EB f, X ) = E f, X is a property of real-valued conditional expectations we find right away that f, E(X) = E f, X = E(EB f, X ) = E f, EBX = f, E(EBX) . Because E(X) and E(EBX) are elements of H and for all f H f, E(X) E(EBX) = 0 it follows that E(X) = E(EBX). Another result we need is that if Z is B-measurable then EB X, Z = EBX, Z (a.s.). Showing this needs a bit more work. Since Z L2(Ω, B, P; H) there exist B-measurable simple functions Un such that Un converges point-wise to Z, limn U n Z 2 = 0 and the sequence fulfills Un 3 Z for all n N (Pisier, 2016)[Prop.1.2]. Consider some n and write i=1 hi χAi, GRÜNEWÄLDER AND KHALEGHI for a suitable m N, hi H, Ai B, then i=1 EB( X, hi χAi) i=1 (EB X, hi ) χAi i=1 EBX, hi χAi = EBX, Un (a.s.), because χAi is B-measurable. For the right hand side point-wise convergence of Un to Z tells us that for all ω Ωwe have limn Un(ω) Z(ω) = 0. Because EBX L2(Ω, A, P; H) we also know that EBX is finite almost surely. Therefore, for ω in the corresponding co-negligible set, lim n | (EBX)(ω), Un(ω) (EBX)(ω), Z(ω) | lim n (EBX)(ω) Un(ω) Z(ω) = 0 and limn EBX, Un = EBX, Z almost surely. By the same argument it follows that limn X, Un = X, Z almost surely. Let hn = X, Un and h = X, Z then |hn h| X Un Z 3 X Z . Furthermore, |hn| |h| + 3 X Z 4 X Z 4( X 2 + Z 2). The right hand side lies in L1 and dominates hn. Using Shiryaev (1989)[II. 7.Thm.2(a)], we conclude that lim n EB X, Un = EB X, Z (a.s.) and the result follows. The operator EB is also idempotent and self-adjoint, i.e. EBX = EB(EBX) (a.s) and X , EBZ 2 = EBX , Z 2. B.2 Representation of conditional expectations A well known result in probability theory states that a conditional expectation ESX of a real-valued random variable X given another real-valued random variable S can be written as g(S) with some suitable measurable function g : R R. This result generalizes to our setting. Here, we include the generalized result together with a short proof for reference. Lemma 1 Consider a probability space (Ω, A, P), and let H be a separable Hilbert space. Let S : Ω Rd be a random variable and suppose that η : Ω H is a σ(S)-measurable function. There exists a Bochner-measurable function g : Rd H such that η = g S almost surely. Proof We first show the statement for simple functions, and observing that any arbitrary Bochnermeasurable function can be written as the point-wise limit of a sequence of simple functions, we extend the result to arbitrary η. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS First, assume that η := hχA for some h H and A σ(S). Since S is measurable with respect to B(Rd) there exists some B B(Rd) such that {ω : S(ω) B} = A. Define g : Rd H as g := h χB, where χ denotes the indicator function on Rd. We obtain, η(ω) = hχA(ω) = h χB(S(ω)) so that η = g S. Next, let η := Pm i=1 hiχAi for some m N, h1, . . . , hm H and A1, . . . , Am σ(S). As above, by measurability of S, there exists a sequence B1, . . . , Bm B(Rd) such that Ai = S 1[Bi], i 1, . . . , m. It follows that η(ω) = Pm i=1 hiχAi(ω) = Pm i=1 hi χBi(S(ω)), ω Ω; hence, η = g S for g = Pm i=1 hi χBi. Observe that in both cases g is trivially Bochner-measurable by construction, since it is a simple function. Let η : Ω H be an arbitrary Bochner-measurable function that is also measurable with respect to σ(S). There exists a sequence of simple functions ηn, n N such that for every ω Ωwe have η(ω) = lim n ηn(ω). Since each ηn is a simple function, by our argument above, there exists a sequence of Bochnermeasurable functions gn : Rd H such that ηn = gn S where for each n N the function gn is simple of the form gn = Pmn i=1 hi,n χBi,n for some mn N and a sequence of functions h1,n, . . . , hmn,n H and a sequence of Borel sets B1,n, . . . , Bmn,n B(Rd). Denote by B := {S(ω) : ω Ω} Rd the image of S, and observe that for each x B limn gn(x) exists. To see this, note that by construction, for each x B we have x = S(ω) for some ω Ω, thus, it holds that lim n gn(x) = lim n gn(S(ω)) = lim n ηn(ω) = η(ω). Moreover, we have P(S 1[B]) = P({ω Ω: S(ω) B}) = P(Ω) = 1. Define g : Rd H as ( limn gn(x) x B 0 x / B (16) Thus, for each ω Ωwith probability 1, we have η(ω) = lim n ηn(ω) = lim n gn(S(ω)) so that η = g S almost surely. On the other hand, since by definition, g is the pointwise limit of a sequence of simple functions gn, it is Bochner-measurable, (see Property 1 in Section A.1) and the result follows. Lemma 2 Consider a separable Hilbert space H, a probability space (Ω, A, P) , a Bochnerintegrable random variable φ(X) : Ω H and a random variable S : Ω Rd. There exists a Bochner-measurable function g : Rd H such that ESφ(X) = g S almost surely. Proof Observing that ESφ(X) is a σ(S)-measurable function from Ωto H, the result readily follows from Lemma 1. GRÜNEWÄLDER AND KHALEGHI Appendix C. Proofs C.1 Proof of Proposition 1 Proof Let M := φ[X] denote the manifold corresponding to the image of X under φ, equipped with the subspace topology and corresponding Borel σ-algebra B(M). Define the metric projection map π : H M as a multi-valued function such that π(g) = h M : h g = min h M h g . (17) Note that the min operator in Equation (17) is well-defined since by definition h = φ(x) for some x X, the space X is compact and φ is a continuous function. Observe that π is not a function, but a multi-valued function which assigns to each element g M a subset of M, see, e.g. (Beer, 1993, Section 6.1) for more on this notion. π maps to non-empty compact subsets of M. For each g H, set fg(h) = h g with h M, and note that it is a continuous function from M to R. Let m(g) := minh M h g , which, by the above argument, is well-defined, and observe that, since {m(g)} is a closed subset of R, then π(g) = f 1 g [{m(g)}] is a closed subset of M. Since M is compact as the continuous image of the compact space X it follows that π(g) is compact. π is upper-semicontinuous. As follows from the standard definition, see, e.g. (Beer, 1993, Definition 6.2.4 and Theorem 6.2.5), the multi-valued function π is said to be upper-semicontinuous at a point g0 H if for any open subset V of M such that π(g0) V it holds that π(g) V for each g in some neighbourhood of g0.2 To show the upper-semicontinuity of π we proceed as follows. Take g0 H. Let V be an open subset of M such that π(g0) V . Denote by f M := M\V . Note that f M is compact since it is a closed subset of M which is in turn compact. Therefore, in much the same way as for M, the min operator is well-defined for f M, i.e. the minimum em(g0) := minh f M g0 h exists. Moreover, since π(g0) V and V f M = , it holds that em(g0) > m(g0). Therefore, there exists some δ > 0 such that em(g0) δ + m(g0). Consider an open ball Bg0(δ/3) of radius δ/3 around g0. For every g Bg0(δ/3) and all h π(g0) we have g h g0 g + g0 h δ/3 + m(g0). (18) On the other hand, we have min h f M g h g g0 g0 h m(g0) + 2δ/3. (19) This implies that π(g) f M = because there are already better candidates (closer to g) in π(g0) which is in turn contained in V and thus does not intersect f M). Hence, it must hold that π(g) V . Finally, since the choice of g Bg0(δ/3) is arbitrary, it follows that for all g Bg0(δ/3) we have π(g) V and π is upper-semicontinuous. φ is a homeomorphism. To see this, note that φ is bijective and continuous since the kernel is positive definite and continuous: it is by definition surjective and it is injective since φ(x) = φ(y) for x = y would imply that a2 1 φ(x) 2 2a1a2 φ(x), φ(y) + a2 2 φ(y) 2 = 0 when a1 = a2 = 1. The statement follows now from (Engelking, 1989, Theorem 3.1.13) since X is compact and M is a Hausdorff space. 2. Upper-semicontinuity is also referred to as upper-hemicontinuity for multi-valued functions in the literature. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS Measurable selection. Since π is upper-semicontinuous and maps to compact sets it is uscocompact (Fremlin, 2001, Definition 422A). This implies that π is measurable as a function from H to the compact subsets of M where the latter is equipped with the Vietoris topology and the corresponding Borel algebra (Fremlin, 2001, Proposition 5A4Db). Furthermore, there exists a Borel-measurable function f from the compact, non-empty, subsets of M to M such that f(K) K for every compact, non-empty, subset K of M. Define W = f(π(Z + h )) then W = φ 1(W ) is the continuous image of the measurable function W and W has the stated properties. C.2 Proof of Proposition 2 Proof (a) Let W be the random variable provided by Proposition 1 and let W = φ(W) h . Then Z W 2 = d(Z + h , M). Observe that two applications of the Cauchy-Schwarz inequality yield E(| f, Z f(W) f, h | χB) E| f, (Z φ(W) h ) χB | f E(χB Z W ) P(B) f Z W 2 for all f H. Similarly, for any f H it holds that E| f, Z φ(W) h | f E Z W f Z W 2. Noting that Z is H-independent of S we find that for any f H and B σ(S) |E(f(W) χB) Ef(W)P(B)| = |E((f(W) f, h ) χB) E(f(W) f, h )P(B)| |E( f, Z χB) E f, Z P(B)| + (1 + p P(B)) f Z W 2 2 f d(Z + h , M). (b) For C σ(W) let D be the image of C under W, i.e. D = W[C], D X. For f H let ξC(f) = χD(W) f(W) 2. Now, for any B σ(S), |P(C B) E(f(W) χB)| P(B)1/2(E(χD(W) f(W))2)1/2 ξC(f). Moreover, we have |P(C) Ef(W)| ξC(f). Hence, for any f H it holds that |P(C B) P(C)P(B)| 2ξC(f) + |E(f(W) χB) Ef(W)P(B)| 2(ξC(f) + f d(Z + h , M)). This proves the first part of the proposition. (c) For the second part: by assumption for A σ(Z) there exists a C σ(W) such that P(A C) c. For any such C we have that |P(C) P(A)| P(C A) c and |P(C B) P(A B)| P((C A) B) c. Hence, |P(A B) P(A)P(B)| 2c + 2(ξC(f) + f d(Z + h , M)) for all f H. Taking the infimum over f and C proves the second part of the proposition. GRÜNEWÄLDER AND KHALEGHI C.3 Proof of Proposition 3 First note that since φ is continuous and X is compact it follows that ρ is finite and Z = φ(X) ESφ(X) + Eφ(X) φ(X) + ESφ(X) + Eφ(X) φ(X) + ES φ(X) + E φ(X) (20) where (20) follows from (Diestel and Uhl, 1977, Theorem II.4) and (Pisier, 2016, Proposition 1.12). Let Zi, i n be n independent copies of Z and define Yi := minh M Zi + h h , i n, and Y := minh M Z + h h . By Hoeffding s inequality we have, and the result follows. C.4 Proof of Proposition 4 Proof (a) We first show that ESφ(X) , (Z ) 2 = E(φ(X) ), (Z ) 2. (21) ESφ(X) is an element of L2(Ω, σ(S), P; H) and there exists a sequence of σ(S)-measurable simple function {Un}n N such that limn U n ESφ(X) 2 = 0. In particular, lim n U n, (Z ) 2 = ESφ(X) , (Z ) 2. (22) Furthermore, E(U n) E(φ(X) ) 2 E U n ESφ(X) 2 = U n ESφ(X) 2 (23) goes to zero in n. Consider any Un = Pm i=1 hi χAi where hi H, Ai σ(S), m N, and observe that U n, (Z ) 2 = i=1 E hi χAi, Z = i=1 E( hi, Z χAi) = i=1 E hi, Z E(χAi), using the assumption on Z . The assumption can be applied because χAi is σ(S)-measurable, and, hence, can be written as a function of S (Shiryaev, 1989)[II. 4.Thm.3]. Furthermore, i=1 E hi, Z E(χAi) = E i=1 hi E(χAi), Z = E E(U n), Z and U n, (Z ) 2 = E(U n), (Z ) 2. Equation (21) follows now from Equation (22), (23) and E(ESφ(X) ) = E(φ(X) ). OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS ESφ(X) , (Z ) 2 = E(φ(X) ), (Z ) 2 and φ(X) , ESφ(X) 2 = ESφ(X) 2 2, it readily follows that φ(X) (Z ) 2 2 = φ(X) Z 2 2 + 2 ESφ(X) E(φ(X) ), φ(X) ESφ(X) + E(φ(X) ) (Z ) 2 + Z (Z ) 2 2 = φ(X) Z 2 2 + Z (Z ) 2 2. Hence, Z is a minimizer and it is almost surely unique because Z (Z ) 2 2 is only zero if Z = (Z ) . C.5 Proof of Proposition 5 Proof (a) In the following, let s1, . . . , sl be the values S can attain. Furthermore, let fi = En(φ(X)|S = si) E(φ(X)|S = si), and let F = σ(X1, S1, . . . , Xn, Sn). Each fi is Fmeasurable. Observe that for i = j, EF( fi χ{S = si}, fj χ{S = sj} ) = EF( fi, fj χ{S = si, S = sj}) = fi, fj EF(χ{S = si, S = sj}) = fi, fj P(S = si, S = sj) since fi, fj are F-measurable and S is independent of F. Hence, EF( ES nφ(X) ESφ(X) 2) = EF i=1 fi χ{S = si} 2 i=1 EF( fi χ{S = si} 2) i=1 EF( fi 2 χ{S = si}) i=1 fi 2P(S = si) i=1 P(S = si) sup h 1 |En(h(X)|S = si) E(h(X)|S = si)|2. (b) For each i either P(S = si) = 0 or sup h 1 |En(h(X)|S = si) E(h(X)|S = si)|2 O P (n 1) GRÜNEWÄLDER AND KHALEGHI by an argument similar to that given for (Grünewälder, 2018, Proposition 3.1). Since there are only l terms in the sum this result carries over to the whole sum. C.6 Proof of Proposition 6 Proof Recall the notation Dℓ:= { i : i 1, . . . , ℓd}, ℓ N where 1, 2, . . . , ℓd are the dyadic cubes 1, 2, . . . , ℓd of side-length 1/ℓdiscretizing S. Let G := σ({S 1[ ] : Dℓ}) and choose a Bochner measurable g : S H according to Lemma 2 such that g(S) = ESφ(X) (a.s.). Since G σ(S) we have, EGφ(X) = EG(ESφ(X))) = EG(g(S)) almost surely. (24) In the following, we use g S instead of g(S) for readability. With probability one it holds that, EF g S EG(g S) 2 Dℓ g S EG(g S) 2χ{S } Dℓ EF (g S EG(g S))χ{S } 2 Dℓ EF (g S X Dℓ E(g S|S )χ{S })χ{S } 2 Dℓ EF g S E(g S|S ) 2χ{S } By (Diestel and Uhl, 1977, II.Corollary 8) for any Dℓit holds that the conditional expectation of g given is in the closed convex hull of g[ ] := {g(s) : s }. That is, g dµ cch(g[ ]) This means that for every ϵ > 0 there exist k N, and some s1, . . . , sk and α1, . . . , αk > 0 with Pk j=1 αi = 1 such that j=1 αjg(sj) 2 ϵ. Let D := S 1[ ]. We obtain g dµ = 1 P(D) D (g S) d P = E(g S|S ). OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS Since g S is assumed to be L-Lipschitz-continuous, for all Dℓwe have j=1 αjg(sj) 2χ{S } j=1 αj(g(s) g(sj)) 2 j=1 αj g(s) g(sj) )2 j=1 αj sup s g(s) g(sj) 2 j=1 αj sup s s sj 2 Moreover, noting that χ{ } = χ2{ } we obtain, g S Pk j=1 αjg(sj) χ{S } L dℓ 1. It follows that, (g S E(g S|S ))χ{S } 2 = (g S E(g S|S )) 2χ{S } j=1 αjg(sj) + k X j=1 αjg(sj) E(g S|S ) 2 χ{S } j=1 αjg(sj)) + ϵ 2 χ{S }. Since this holds for every ϵ > 0 we have, (g S E(g S|S ))χ{S } 2 d L2ℓ 2. Observe that for = , , Dℓ, EF g S E(g S|S ) χ{S } g S E(g S|S ) χ{S } = 0 EF g S EG(g S) 2 = X Dℓ EF g S E(g S|S ) 2χ{S } Dℓ EF g S E(g S|S ) 2χ2{S } Dℓ EF (g S E(g S|S ))χ{S } 2χ{S } GRÜNEWÄLDER AND KHALEGHI In particular, EF g S EGφ(X) 2 d L2ℓ 2. (25) On the other hand, in much the same way as in the proof of Proposition 5, we have EF( ES nφ(X) EGφ(X) 2) = X Dℓ P(S ) sup h 1 |En(h(X)|S ) E(h(X)|S )|2. Let U := (X, S) and define the push forward measure ν := P U 1 of P onto X S under U. Set νn := 1 n Pn i=1 δ(Xi,Si) where δ(X,S) denotes the measure that has point mass at (X, S). Define the projection map π : X S X which maps a tuple (x, s) X S to its first element so that π((x, s)) = x. For each h H such that h 1 and every Dℓwe obtain |En(h(X)|S ) E(h(X)|S )|2 = |En(h(π(U))|U X ) E(h(π(U))|U X )|2 X h π dνn Z X h π dν 2 . For each ℓ N define Cℓ:= {X : Dℓ}. By assumption, HC = {h χD : h H, h 1, D S ℓ N Cℓ} is P-Donsker and for D Cℓ, with D = X i for i ℓ, ν(D) = PS 1[ i] bℓ d. For a given α (0, 1/2) let ℓbe nα/d so that ℓ d n α. Similarly to (Grünewälder, 2018, Proposition 3.2) it follows that there exists a constant M such that for all n 1 and corresponding ℓ, sup h 1 sup C Cℓ C h π dνn Z C h π dν 2Mℓdn 1/2/b. EF( ES nφ(X) EGφ(X) 2) 4M2ℓ2d/nb2. (26) Using Equation (25) and (26) as well as the Cauchy-Schwarz inequality for conditional expectations we obtain, EF( ES nφ(X) ESφ(X) 2) EF( ES nφ(X) EGφ(X) + EGφ(X) ESφ(X) )2 4M2ℓ2d/nb2 + 4M d Lℓd 1n 1/2/b + d L2ℓ 2. Because ℓ= nα/d the upper-bound becomes 4M2n2α 1/b2 + d L2/(n α/d 1)2 + 4M We claim that the rate of convergence in n is optimized by α = d/2(d + 1): For α α we have 2α 1 α(1 1/d) 1/2 2α/d and the dominant term 2α 1 is minimized at α . On the other hand, for α α , 2α 1 2α/d and α(1 1/d) 1/2 2α/d. In this case the dominant term is also minimized for α . Therefore, we must set ℓ = nα /d = n OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS Appendix D. Solution to the oblivious kernel ridge regression optimization problem Define zi := ( Z1, Zi Zn, Zi ) , i 1..n, and observe that | | . . . | z1 z2 . . . zn | | . . . | Let ˆf be the minimizer of the regularized least-squares error as given by (11). By the representer theorem there exist scalars α1, . . . , αn such that ˆf = Pn j=1 αj Zj. It follows that ˆf, Zi = Pn j=1 αj Zj, Zi so that, i=1 ( ˆf, Zi Yi)2 + λ ˆf 2 = (Oα y)(Oα y) + λα Oα (27) where α := (α1, . . . , αn) and y := (Y1, . . . , Yn) . Noting that ˆf is the minimizer, and thus taking the gradient of (27) with respect to α we obtain, α (Oα y)(Oα y) + λα Oα = 0. Solving for α and noting that O is symmetric, we obtain α = O 1 O + λI 1 O y = O 1 O + λI 1 Oy since O is symmetric = O 1 O + λI 1 (O 1) 1y = O 1 O + λI O 1 y = (O 1O + λO 1)O 1 y since O is symmetric = (O + λI) 1y. Appendix E. Algorithms We discuss three algorithms in this section: an algorithm to calculate the oblivious kernel matrix (Section E.1), an algorithm to calculate Z, Zi which is needed for prediction (Section E.2), and an algorithm to calculate W, the projection of Zi onto M, which also allows us to estimate the distance between Z and M (Section E.3). E.1 Calculating the oblivious kernel matrix We start by deriving the algorithm for calculating the oblivious matrix. The result algorithm is summarized in Algorithm 1 on page 32. Throughout we assume that A1, . . . , Al is a partition of S and we assume that 2n samples (Xi, Si) are available. The algorithm splits the data into two parts of size n and uses the samples n + 1, . . . , 2n to estimate the conditional expectation. The remaining n GRÜNEWÄLDER AND KHALEGHI Algorithm 1 Generating the oblivious kernel matrix; the sum over an empty index set is treated as 0 Input: data (x1, s1), . . . (x2n, s2n), disjoint sets A1, . . . , Aℓwhich cover S set M = P2n i=n+1 P2n j=n+1 k(xi, xj)/n2 set Ii = , i 1, . . . , ℓ for i = n + 1 to 2n do find index u such that si Au update Iu Iu {i} end for for i = 1 to n do set ρi = P2n u=n+1 k(xi, xu)/n for a = 1 to l do set ξi,a = P u Ia k(xi, xu)/|Ia| end for end for for a = 1 to l do set τa = P u Ia P2n v=n+1 k(xu, xv)/(n|Ia|) for b = 1 to l do set oa,b = P u Ia,v Ib k(xu, xv)/(|Ia||Ib|) end for end for for i = 1 to n do for j = i to n do set a such that sj Aa set b such that si Ab set Oi,j = k(xi, xj) ξi,a ξj,b + oa,b + M ρi ρj τa τb set Oj,i = Oi,j end for end for Return: O samples are then used to generate the features Zi, i = 1, . . . , n. The features Zi will not be explicitly stored. The only thing that will be stored is the oblivious matrix O. To calculate the oblivious matrix we only need kernel evaluations. To see this consider any i n, then Zi = φ(Xi) ESi n φ(X) = φ(Xi) u=1 En(φ(X)|S Au) χ{Si Au}. For u = 1, . . . , l let v=n+1 χ{Sv Au} OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS be the number of samples with indices within n + 1, . . . , 2n that fall into set Au. The estimate of the elementary conditional expectation is En(φ(X)|S Au) = 1 v=n+1 φ(Xv) χ{Sv Au}, which attains values in H. Now consider the inner product between Zi and Zj, i, j n: Zi, Zj = φ(Xi), φ(Xj) φ(Xi), ESj n φ(X) ESi n φ(X), φ(Xj) + ESi n φ(X), ESj n φ(X) + φ(Xi), En(φ(X)) + En(φ(X)), φ(Xj) ESi n φ(X), En(φ(X)) En(φ(X)), ESj n φ(X) + En(φ(X)), En(φ(X)) . This reduces to calculations involving only the kernel function and no other functions from H. In detail, φ(Xi), φ(Xj) = k(Xi, Xj), φ(Xi), ESj n φ(X) = u=1 φ(Xi), En(φ(X)|S Au) χ{Sj Au}, φ(Xi), En(φ(X)|S Au) = 1 l=n+1 φ(Xi), φ(Xl) χ{Sl Au} l=n+1 k(Xi, Xl) χ{Sl Au}. The inner product En(φ(X)|S Au), φ(Xj) can be calculated in the same way. Furthermore, ESi n φ(X), ESj n φ(X) = v=1 En(φ(X)|S Au), En(φ(X)|S Av) χ{Si Au, Sj Av} En(φ(X)|S Au), En(φ(X)|S Av) m=n+1 φ(Xl), φ(Xm) χ{Sl Au, Sm Av} m=n+1 k(Xl, Xm) χ{Sl Au, Sm Av}. The terms involving En(φ(X)) = (1/n) P2n i=n+1 φ(Xi) are reduced in a similar way to kernel evaluations. Combining these calculations leads to Algorithm 1. GRÜNEWÄLDER AND KHALEGHI E.2 Prediction based on oblivious features To be able to predict labels for new observations (X, S) in a regression or classification setting we need to transform (X, S) into an oblivious feature Z. The approach to do is the same as for the training data. In particular, the conditional expectation estimates ESj n φ(X) are needed to transform (X, S) into Z. For kernel methods Z itself is never calculated explicitly but it appears in algorithms in the form of inner products Z, Zi , where i n and Zi are the oblivious features corresponding to the training set. These inner product can be calculated in exactly the same way as the inner products Zi, Zj in Section E.1. E.3 Projecting the oblivious features onto the manifold The quadratic distance between Z, or more precisely Z(ω), and M in H is equal to inf x X Z φ(x) 2 = Z 2 + inf x X(k(x, x) 2 Z, φ(x) ). The constant Z 2 is of no relevance and we are looking for a minimum (when this is well-defined) of the function f(x) = k(x, x) 2 Z, φ(x) in X. Using the conditional expectation ES nφ(X) and Z = φ(X) ES nφ(X) + E(φ(X)) we can rewrite f(x) as f(x) = k(x, x) 2(k(X, x) ES nk(X, x) + E(k(X, x)). The function f is (α, L)-Hölder continuous whenever k(x, ) is (α, L )-Hölder-continuous for all x X with L = 8L , since then |f(x) f(y)| |k(x, x) k(y, y)| + 2(|k(X, x) k(X, y)| + ES n|k(X, y) k(X, x)| + E|k(X, x) k(X, y)|) |k(x, x) k(x, y)| + |k(x, y) k(y, y)| + 6L y x α = 8L y x α. This property of f is useful because various kernel functions are Hölder-continuous and efficient algorithms are available to optimize Hölder-continuous functions. In particular, there exist classical global optimization algorithms (Vanderbei, 1997) and bandit algorithms (Munos, 2014) for this task. The projection of Z onto M can also be used directly to approximate dn(Z + h , M) and, by applying Proposition 3, to estimate d(Z + h , M). G. Beer. Topologies on closed and closed convex sets, volume 268. Springer Science & Business Media, 1993. D. P. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. Athena Scientific, 1st edition, 2002. R.V. Bradley. Introduction to Strong Mixing Conditions, Vols. 1, 2 and 3. Kendrick Press, 2007. T. Calders, F. Kamiran, and M. Pechenizkiy. Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, 2009. OBLIVIOUS DATA FOR FAIRNESS WITH KERNELS F. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and K. R. Varshney. Optimized preprocessing for discrimination prevention. In In Advances in Neural Information Processing Systems, 2017. J. Diestel and J.J. Uhl. Vector measures. American Mathematical Soc., 1977. M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Advances in Neural Information Processing Systems, 2018. P. Doukhan. Mixing: Properties and Examples. Springer Lecture Notes, 1994. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. R.M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 2nd edition, 2014. R. Engelking. General Topology. Heldermann Verlag Berlin, 1989. D.H. Fremlin. Measure Theory. Torres Fremlin, 2001. E. Giné and R. Nickl. Mathematical Foundations of Infinite-dimensional Statistical Models. Cambridge University Press, 2016. A. Gretton, K. Fukumizu, CH. Teo, L. Song, B. Schölkopf, and AJ. Smola. A kernel statistical test of independence. In Advances in neural information processing systems, 2008. S. Grünewälder. Plug-in estimators for conditional expectations and probabilities. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018. M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, 2016. M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth. Fairness in learning: Classic and contextual bandits. In In Advances in Neural Information Processing Systems, 2016. N. Kilbertus, M. Rojas-Carulla G., Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf. Avoiding discrimination through causal reasoning. In In Advances in Neural Information Processing Systems, 2017. J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent Trade-Offs in the Fair Determination of Risk Scores. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference, volume 67. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2017. M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In In Advances in Neural Information Processing Systems, 2017. C. Louizos, K. Swersky, Y. Li, M. Welling, and R. S. Zemel. The variational fair autoencoder. In International Conference on Learning Representations, 2015. D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, 2018. GRÜNEWÄLDER AND KHALEGHI D.J. Marcus. Relationships between donsker classes and sobolev spaces. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 1985. R. Munos. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 2014. L. Oneto, M. Donini, G. Luise, C. Ciliberto, A. Maurer, and M. Pontil. Exploiting mmd and sinkhorn divergences for fair and transferable representation learning. In Advances in Neural Information Processing Systems, 2020. G. Pisier. Martingales in Banach Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2016. A. Shiryaev. Probability. Springer: Graduate Texts in Mathematics, second edition, 1989. B. K. Sriperumbudur, K. Fukumizu, and G.R.G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 2011. R.J. Vanderbei. Extension of piyavskii s algorithm to continuous global optimization. Technical report, Princeton University, 1997. M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K.P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, 2017. R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In In International Conference on Machine Learning, 2013.