# learning_with_explanation_constraints__538c85ae.pdf Learning with Explanation Constraints Rattana Pukdee Carnegie Mellon University rpukdee@cs.cmu.edu Dylan Sam Carnegie Mellon University dylansam@andrew.cmu.edu J. Zico Kolter Carnegie Mellon University Bosch Center for AI zkolter@cs.cmu.edu Maria-Florina Balcan Carnegie Mellon University ninamf@cs.cmu.edu Pradeep Ravikumar Carnegie Mellon University pkr@cs.cmu.edu As larger deep learning models are hard to interpret, there has been a recent focus on generating explanations of these black-box models. In contrast, we may have apriori explanations of how models should behave. In this paper, we formalize this notion as learning from explanation constraints and provide a learning theoretic framework to analyze how such explanations can improve the learning of our models. One may naturally ask, When would these explanations be helpful? Our first key contribution addresses this question via a class of models that satisfies these explanation constraints in expectation over new data. We provide a characterization of the benefits of these models (in terms of the reduction of their Rademacher complexities) for a canonical class of explanations given by gradient information in the settings of both linear models and two layer neural networks. In addition, we provide an algorithmic solution for our framework, via a variational approximation that achieves better performance and satisfies these constraints more frequently, when compared to simpler augmented Lagrangian methods to incorporate these explanations. We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments. 1 Introduction There has been a considerable recent focus on generating explanations of complex black-box models so that humans may better understand their decisions. These can take the form of feature importance [31, 35], counterfactuals [31, 35], influential training samples [18, 43], etc. But what if humans were able to provide explanations for how these models should behave? We are interested in the question of how to learn models given such apriori explanations. A recent line of work incorporates explanations as a regularizer, penalizing models that do not exhibit apriori given explanations [33, 32, 15, 36]. For example, Rieger et al. [32] penalize the feature importance of spurious patches on a skin-cancer classification task. These methods lead to models that inherently satisfy desirable properties and, thus, are more trustworthy. In addition, some of these empirical results suggest that constraining models via explanations also leads to higher accuracy and robustness to changing test environments. However, there is no theoretical analysis to explain this phenomenon. We note that such explanations can arise from domain experts and domain knowledge, but also other large teacher models that might have been developed for related tasks. An attractive facet of the latter is that we can automatically generate model-based explanations given unlabeled data points. For instance, we can use segmentation models to select the background pixels of images solely on Equal contribution 37th Conference on Neural Information Processing Systems (Neur IPS 2023). low empirical explanation risk Unlabeled data Explanation Constraints Satisfy Explanation Constraints Minimize Empirical Risk Match Unlabeled data Labeled data Teacher model Student model Figure 1: A restricted hypothesis class Hϕ,α (left). Our algorithmic solution to solve a proposed variational objective in Section 5 (right). unlabeled data, which we can use in our model training. We thus view incorporating explanation constraints from such teacher models as a form of knowledge distillation into our student models [13]. In this paper, we provide an analytical framework for learning from explanations to reason when and how explanations can improve the model performance. We first provide a mathematical framework for model constraints given explanations. Casting explanations as functionals g that take in a model h and input x (as is standard in explainable AI), we can represent domain knowledge of how models should behave as constraints on the values of such explanations. We can leverage these to then solve a constrained ERM problem where we additionally constrain the model to satisfy these explanation constraints. Since the explanations and constraints are provided on randomly sampled inputs, these constraints are random. Nevertheless, via standard statistical learning theoretic arguments [38], any model that satisfies the set of explanation constraints on the finite sample can be shown to satisfy the constraints in expectation up to some slack with high probability. In our work, we term a model that satisfies explanations constraints in expectation, an CE model (see Definition 1). Then, we can capture the benefit of learning with explanation constraints by analyzing the generalization capabilities of this class of CE models (Theorem 3.2). This analysis builds off of a learning theoretic framework for semi-supervised learning of Balcan and Blum [1, 2]. We remark that if the explanation constraints are arbitrary, it is not possible to reason if a model satisfies the constraints in expectation based on a finite sample. We provide a detailed discussion on when this argument is possible in Appendix B,D. In addition, we note that our work also has a connection with classical approaches in stochastic programming [16, 4] and is worth investigating this relationship further. Another key contribution of our work is concretely analyzing this framework for a canonical class of explanation constraints given by gradient information for linear models (Theorem 4.1) and two layer neural networks (Theorem 4.2). We focus on gradient constraints as we can represent many different notions of explanations, such as feature importance and ignoring spurious features. These corollaries clearly illustrate that restricting the hypothesis class via explanation constraints can lead to fewer required labeled data. Our results also provide a quantitative measure of the benefits of the explanation constraints in terms of the number of labeled data. We also discuss when learning these explanation constraints makes sense or is possible (i.e., with a finite generalization bound). We note that our framework allows for the explanations to be noisy, and not fully satisfied by even the Bayes optimal classifier. Why then would incorporating explanation constraints help? As our analysis shows, this is by reducing the estimation error (variance) by constraining the hypothesis class, at the expense of approximation error (bias). We defer the question of how to explicitly denoise noisy explanations to future work. Now that we have provided a learning theoretic framework for these explanation constraints, we next consider the algorithmic question: how do we solve for these explanation-constrained models? In general, these constraints are not necessarily well-behaved and are difficult to optimize. One can use augmented Lagrangian approaches [33, 7], or simply regularized versions of our constrained problems [32] (which however do not in general solve the constrained problems for non-convex parameterizations but is more computationally tractable). We draw from seminal work in posterior regularization [9], which has also been studied in the capacity of model distillation [14], to provide a variational objective. Our objective is composed of two terms; supervised empirical risk and the discrepancy between the current model and the class of CE models. The optimal solution of our objective is also the optimal solution of the constrained problem which is consistent with our theoretical analysis. Our objective naturally incorporates unlabeled data and provides a simple way to control the trade-off between explanation constraints and the supervised loss (Section 5). We propose a tractable algorithm that iteratively trains a model on the supervised data, and then approximately projects this learnt model onto the class of CE models. Finally, we provide an extensive array of experiments that capture the benefits of learning from explanation constraints. These experiments also demonstrate that the variational approach improves over simpler augmented Lagrangian approaches and can lead to models that indeed satisfy explanations more frequently. 2 Related Work Explainable AI. Recent advances in deep learning have led to models that achieve high performance but which are also highly complex [20, 11]. Understanding these complex models is crucial for safe and reliable deployments of these systems in the real-world. One approach to improve our understanding of a model is through explanations. This can take many forms such as feature importance [31, 35, 23, 37], high level concepts [17, 44], counterfactual examples [39, 12, 25], robustness of gradients [41], or influential training samples [18, 43]. In contrast to generating post-hoc explanations of a given model, we aim to learn models given apriori explanations. There has been some recent work along such lines. Koh et al. [19], Zarlenga et al. [45] incorporates explanations within the model architecture by requiring a conceptual bottleneck layer. Ross et al. [33], Rieger et al. [32], Ismail et al. [15], Stacey et al. [36] use explanations to modify the learning procedure for any class of models: they incorporate explanations as a regularizer, penalizing models that do not exhibit apriori given explanations; Ross et al. [33] penalize input gradients, while Rieger et al. [32] penalize a Contextual Decomposition score [26]. Some of these suggest that constraining models via explanations leads to higher accuracies and more robustness to spurious correlation, but do not provide analytical guarantees. On the theoretical front, Li et al. [22] show that models that are easier to explain locally also generalize well. However, Bilodeau et al. [3] show that common feature attribution methods without additional assumptions on the learning algorithm or data distribution do no better than random guessing at inferring counterfactual model behavior. Learning Theory. Our contribution is to provide an analytical framework for learning from explanations that quantify the benefits of explanation constraints. Our analysis is closely related to the framework of learning with side information. Balcan and Blum [2] shows how unlabeled data can help in semi-supervised learning through a notion of compatibility between the data and the target model. This work studies classical notions of side information (e.g., margin, smoothness, and cotraining). Subsequent papers have adapted this learning theoretic framework to study the benefits of representation learning [10] and transformation invariance [34]. On the contrary, our paper focuses on the more recent notion of explanations. Rather than focus on the benefits of unlabeled data, we characterize the quality of different explanations. We highlight that constraints here are stochastic, as they depend on data points which differs from deterministic constraints that have been considered in existing literature, such as constraints on the norm of weights (i.e., L2 regularization). Self-Training. Our work can also be connected to the self-training literature [5, 42, 40, 8], where we could view our variational objective as comprising a regularized (potentially simpler) teacher model that encodes these explanation constraints into a student model. Our variational objective (where we use simpler teacher models) is also related to distillation, which has also been studied in terms of gradients [6]. 3 Learning from Explanation Constraints Let X be the instance space and Y be the label space. We focus on binary classification where Y = { 1, 1}, but which can be naturally generalized. Let D be the joint data distribution over (X, Y) and DX the marginal distribution over X. For any classifier h : X Y, we are interested in its classification error err(h) := Pr(x,y) D(h(x) = y), though one could also use other losses to define classification error. Our goal is to learn a classifier with small error from a family of functions H. In this work, we use the words model and classifier interchangeably. Now, we formalize local explanations as functionals that take in a model and a test input, and output a vector: Definition 1 (Explanations). Given an instance space X, model hypothesis class H, and an explanation functional g : H X Rr, we say g(h, x) is an explanation of h on point x induced by g. For simplicity, we consider the setting when g takes a single data point and model as input, but this can be naturally extended to multiple data points and models. We can combine these explanations with prior knowledge on how explanations should look like at sample points in term of constraints. Definition 2 (Explanation Constraint Set). For any instance space X, hypothesis class H, an explanation functional g : H X Rr, and a family of constraint sets {C(x) Rr | x X}, we say that h H satisfies the explanation constraints with respect to C iff: g(h, x) C(x), x X. In our definition, C(x) represents values that we believe our explanations should take at a point x. For example, an input gradient of a feature 1 must be larger than feature 2 can be represented by g(h, x) = xh(x) and C(x) = {(x1, . . . , xd) Rd | x1 > x2}. In practice, human annotators will be able to provide the constraint set C(x ) for a random sample k data points SE = {x 1, . . . , x k} drawn i.i.d. from DX . We then say that any h H SE-satisfies the explanation constraints with respect to C iff g(h, x) C(x), x SE. We note that the constraints depends on random samples x i and therefore are random. To tackle this challenge, we can draw from the standard learning theoretic arguments to reason about probably approximately satisfying the constraints in expectation. Before doing so, we first consider the notion of explanation surrogate losses, which will allow us to generalize the setup above to a form that is amenable to practical estimators. Definition 3. (Explanation surrogate loss) An explanation surrogate loss ϕ : H X R quantifies how well a model h satisfies the explanation constraint g(h, x) C(x). For any h H, x X: 1. ϕ(h, x) 0. 2. If g(h, x) C(x) then ϕ(h, x) = 0. For example, we could define ϕ(h, x) = 1{g(h, x) C(x)}. Given such a surrogate loss, we can substitute the explanation constraint that g(h, x) C(x) with the surrogate ϕ(h, x) 0. We now have the machinery to formalize how to reason about the random explanation constraints given a random set of inputs. First, denote the expected explanation loss as ϕ(h, D) := Ex D[ϕ(h, x)]. We are interested in models that satisfy the explanation constraints up to some slack τ (i.e. approximately) in expectation. We define a learnability condition of this explanation surrogate loss as EPAC (Explanation Probably Approximately Correct ) learnability. Definition 4 (EPAC learnability). For any δ (0, 1), τ > 0, the sample complexity of (δ, τ) - EPAC learning of H with respect to a surrogate loss ϕ, denoted m(τ, δ; H, ϕ) is defined as the smallest m N for which there exists a learning rule A such that every data distribution DX over X, with probability at least 1 δ over S Dm, ϕ(A(S), D) inf h H ϕ(h, D) + τ. If no such m exists, define m(τ, δ; H, ϕ) = . We say that H is EPAC learnable in the agnostic setting with respect to a surrogate loss ϕ if δ (0, 1), τ > 0, m(τ, δ; H, ϕ) is finite. Furthermore, for a constant τ, we denote any model h H with τ-Approximately Correct Explanation where ϕ(h, D) τ, with a τ - CE models. We define the class of τ - CE models as Hϕ,D,τ = {h H : ϕ(h, D) τ}. (1) We simply use Hϕ,τ to denote this class of CE models. From natural statistical learning theoretic arguments, a model that satisfies the random constraints in SE might also be a CE model. Proposition 3.1. Suppose a model h SE-satisfies the explanation constraints then ϕ(h, DX ) 2Rk(G) + with probability at least 1 δ, when k = |SE| and G = {ϕ(h, ) | h H}. We use Rk( ) to denote Rademacher complexity; please see Appendix A where we review this and related concepts. Note that even when h satisfies the constraints exactly on S, we can only guarantee a bound on the expected surrogate loss ϕ(h, DX ).We can achieve a bound similar to that in Proposition 3.1 via a single and simpler constraint on the empirical expectation ϕ(h, SE) = 1 |SE| P x SE ϕ(h, x). We can then extend the above proposition to show that if ϕ(h, SE) τ, then ϕ(h, DX ) τ + 2Rk(G) + q 2k , with probability at least 1 δ. Another advantage of such a constraint is that the explanation constraints could be noisy, or it may be difficult to satisfy them exactly, so τ also serves as a slack. The class G contains all surrogate losses of any h H. Depending on the explanation constraints, G can be extremely large. We remark that the surrogate loss ϕ allows us to reason about satisfying an explanation constraint on a new data point and in expectation. However, for many constraints, ϕ does not have a closed-form or is unknown on an unseen data point. The question of which types of explanation constraints are generalizable may be of independent interest, and we further discuss this in Appendix B and provide further examples of learnable constraints in Appendix D. EPAC-ERM Objective. Let us next discuss combining the two sources of information: the explanation constraints that we set up in the previous section, together with the usual set of labeled training samples S = {(x1, y1), . . . , (xn, yn)} drawn i.i.d. from D that informs the empirical risk. Combining these, we get what we call EPAC-ERM objective: min h H 1 n i=1 ℓ(h, xi, yi) s.t. 1 i=1 ϕ(h, x i) τ. (2) We provide a learnability condition for a model that achieve both low average error and surrogate loss in Appendix F. 3.1 Generalization Bound We assume that we are in a doubly agnostic setting. Firstly, we are agnostic in the usual sense that there need be no classifier in the hypothesis class H that perfectly labels (x, y); instead, we hope to achieve the best error rate in the hypothesis class, h = arg minh H err D(h). Secondly, we are also agnostic with respect to the explanations, so that the optimal classifier h may not satisfy the explanation constraints exactly, so that it incurs nonzero surrogate explanation loss ϕ(h , D) > 0. Theorem 3.2 (Generalization Bound for Agnostic Setting). Consider a hypothesis class H, distribution D, and explanation loss ϕ. Let S = {(x1, y1), . . . , (xn, yn)} be drawn i.i.d. from D and SE = {x 1, . . . , x k} drawn i.i.d. from DX . With probability at least 1 δ, for h H that minimizes empirical risk err S(h) and has ϕ(h, SE) τ, we have err D(h) err D(h τ εk) + 2Rn(Hϕ,τ+εk) + 2 εk = 2Rk(G) + when G = {ϕ(h, x) | h H, x X} and h τ = arg minh Hϕ,τ err D(h). Proof. The proof largely follows the arguments in Balcan and Blum [2], but we use Rademacher complexity-based deviation bounds instead of VC-entropy. We defer the full proof to Appendix E. Our bound suggests that these constraints help with our learning by shrinking the hypothesis class H to Hϕ,τ+εk, reducing the required sample complexity. However, there is also a trade-off between reduction and accuracy. In our bound, we compare against the best classifier h τ εk Hϕ,τ εk instead of h . Since we may have ϕ(h , D) > 0, if τ is too small, we may reduce H to a hypothesis class that does not contain any good classifiers. Recall that the generalization bound for standard supervised learning in the absence of explanation constraints is given by err D(h) err D(h ) + 2Rn(H) + 2 We can see the difference between this upper bound and the upper bound in Theorem 3.2 here as a possible notion of the goodness of an explanation constraint. We further discuss this in Appendix C. Figure 2: Visualization of the piecewise constant function of xh(x) xh (x) when h is a two layer NNs with 1 node. Background colors represent regions with non-zero value. 4 Gradient Explanations for Particular Hypothesis Classes In this section, we further quantify the usefulness of explanation constraints on different concrete examples and characterize the Rademacher complexity of the restricted hypothesis classes. In particular, we consider an explanation constraint of a constraint on the input gradient. For example, we may want our model s gradient to be close to that of some h H. This translates to g(h, x) = xh(x) and C(x) = {x Rd | x xh (x) τ} for some τ > 0. 4.1 Gradient Explanations for Linear Models We now consider the case of a uniform distribution on a sphere, and we use the symmetry of this distribution to derive an upper bound on the Rademacher complexity (full proof to Appendix H). Theorem 4.1 (Rademacher complexity of linear models with a gradient constraint, uniform distribution on a sphere). Let DX be a uniform distribution on a unit sphere in Rd, let H = {h : x 7 wh, x | wh Rd, ||wh||2 B} be a class of linear models with weights bounded by a constant B. Let ϕ(h, x) = θ(wh, wh ) be a surrogate loss where θ(u, v) is an angle between u, v. We have Rn(Hϕ,τ) B n sin(τ) p + 1 p where p = erf and erf(x) = 2 π R x 0 e t2dt is the standard error function. The standard upper bound on the Rademacher complexity of linear models is B n. Our bound has a nice interpretation; we shrink our bound by a factor of ( 1 p 2 + sin(τ)p). We remark that d increases, we observe that p 1, so the term sin(τ)p dominates this factor. As a consequence, we get that our bound is now scaled by sin(τ) τ and the the Rademacher complexity scales down by a factor of τ. This implies that given n labeled data, to achieve a fast rate O( 1 n), we need τ to be as good as O( 1 n). 4.2 Gradient Explanations for Two Layer Neural Networks Theorem 4.2 (Rademacher complexity of two layer neural networks (m hidden nodes) with a gradient constraint). Let X be an instance space and DX be a distribution over X with a large enough support. Let H = {h : x 7 Pm j=1 wjσ(u j x)|wj R, uj Rd, Pm j=1 |wj| B, uj 2 = 1} be a class of two layer neural networks with a Re LU activation function and bounded weight. Assume that there exists some constant C > 0 such that Ex DX [ x 2 2] C2. Consider an explanation loss given by ϕ(h, x) = xh(x) xh (x) 2 + 1{ xh(x) xh (x) > τ} for some τ > 0. Then, we have that Rn(Hϕ,τ) 3τm C n . Proof. (Sketch) The key ingredient is to identify the impact of the gradient constraint and the form of class Hϕ,τ. We provide an idea when we have m = 1 node. We write h(x) = wσ(u x) and h (x) = w σ(u x). Note that xh(x) xh (x) = wu1{u x > 0} w u 1{(u ) x > 0} is a piecewise constant function (Figure 2). Assume that the probability mass of each region is nonnegative, our gradient constraint implies that the norm of each region cannot be larger than τ. 1. If u, u have different directions, we have 4 regions in xh(x) xh (x) and can conclude that |w| < τ, |w | < τ. 2. If u = u have the same direction, we only have 2 regions in xh(x) xh (x) and can conclude that wu w u = |w w | < τ. The gradient constraint enforces a model to have the same node boundary (u = u ) with a small weight difference |w w | < τ or that node would have a small weight |w| < τ. This finding allows us to determine the restricted class Hϕ,τ, and we can use this to bound the Rademacher complexity accordingly. For full details, see Appendix I. We compare this with the standard Rademacher complexity of a two layer neural network [24], Rn(H) 2BC n . We can do better than this standard bound if τ < 2B 3m. One interpretation for this is that we have a budget at most τ to change the weight of each node and for total m nodes, we can change the weight by at most τm. We compare this to B which is an upper bound on the total weight Pm j=1 |wj| B. Therefore, we can do better than a standard bound when we can change the weight by at most two thirds of the average weight 2B 3m for each node. We would like to point out that our bound does not depend on the distribution D because we choose a specific explanation loss that guarantees that the gradient constraint holds almost everywhere. Extending to a weaker loss such as ϕ(h, x) = xh(x) xh (x) is a future research direction. In contrast, our result for linear models uses a weaker explanation loss and depends on D (Theorem H.1). We also assume that there exists x with a positive probability density at any partition created by xh(x). This is not a strong assumption, and it holds for any distribution where the support is the Rd, e.g., Gaussian distributions. 5 Algorithms for Learning from Explanation Constraints Although we have analyzed learning with explanation constraints, algorithms to solve this constrained optimization problem are non-trivial. In this setting, we assume that we have access to n labeled data {(xi, yi)}n i=1, m unlabeled data {xn+i}m i=1, and k data with explanations {(xn+m+i, ϕ( , xn+m+i))}k i=1. We argue that in many cases, n labeled data are the most expensive to annotate. The k data points with explanations also have non-trivial cost; they require an expert to provide the annotated explanation or provide a surrogate loss ϕ. If the surrogate loss is specified then we can evaluate it on any unlabeled data, otherwise these data points with explanations could be expensive. On the other hand, the m data points can cheaply be obtained as they are completely unlabeled. We now consider existing approaches to incorporate this explanation information. EPAC-ERM: Recall our EPAC-ERM objective from (2): min h H 1 n i=1 1{h(xi) = yi} s.t. 1 j=n+m+1 ϕ(h, xj) τ for some constant τ. This constraint in general requires more complex optimization techniques (e.g., running multiple iterations and comparing values of τ) to solve algorithmically. We could also consider the case where τ = 0, which would entail the hypotheses satisfy the explanation constraints exactly, which however is in general too strong a constraint with noisy explanations. Augmented Lagrangian objectives: min h H 1 n i=1 1[h(xi) = yi] + λ j=n+m+1 ϕ(h, xj) As is done in prior work [32], we can consider an augmented Lagrangian objective. A crucial caveat with this approach is that the explanation surrogate loss is in general a much more complicated functional of the hypothesis than the empirical risk. For instance, it might involve the gradient of the hypothesis when we use gradient-based explanations. Computing the gradients of such a surrogate loss can be more expensive compared to the gradients of the empirical risk. For instance, in our experiments, computing the gradients of the surrogate loss that involves input gradients is 2.5 times slower than that of the empirical risk. With the above objective, however, we need to compute the same number of gradients of both the explanation surrogate loss and the empirical risk. These computational difficulties have arguably made incorporating explanation constraints not as popular as they could be. 5.1 Variational Method To alleviate these aforementioned computational difficulties, we propose a new variational objective min h H(1 λ) E (x,y) D [ℓ(h(x), y)] + λ inf f Hϕ,τ E x DX [ℓ(h(x), f(x))] , where ℓis some loss function and t 0 is some threshold. The first term is the standard expected risk of h while the second term can be viewed as a projection distance between h and τ-CE models. It can be seen that the optimal solution of EPAC-ERM would also be an optimal solution of our proposed variational objective. The advantage of this formulation however is that it decouples the standard expected risk component from the surrogate risk component. This allows us to solve this objective with the following iterative technique, drawing inspiration from prior work in posterior regularization [9, 14]. More specifically, let ht be the learned model at time t and at each timestep t, 1. We project ht to the class of τ-CE models. ft+1,ϕ = argmin h H i=n+1 ℓ(h(xi), ht(xi)) + λ max i=n+m+1 ϕ(h, xi) τ The first term is the difference between ht and f on unlabeled data. The second term is the surrogate loss, which we want to be smaller than t. η is a regularization hyperparameter. 2. We calculate ht+1 that minimizes the empirical risk of labeled data and matches pseudolabels from ft+1,ϕ ht+1,ϕ = argmin h H i=1 ℓ(h(xi), yi) + 1 i=n+1 ℓ(h(xi), ft+1,ϕ(xi)). Here, the discrepancy between h and ft+1,ϕ is evaluated on the unlabeled data {xj}n+m j=n+1. The advantage of this decoupling is that we could use a differing number of gradient steps and learning rates for the projection step that involves the complicated surrogate loss when compared to the empirical risk minimization step. Secondly, we can simplify the projection iterate computation by replacing Hϕ,τ with a simpler class of teacher models Fϕ,τ for greater efficiency. Thus, the decoupled approach to solving the EPAC-ERM objective is in general more computationally convenient. We initialize this procedure with some model h0. We remark that could see this as a constraint regularized self-training where ht is a student model and ft is a teacher model. At each timestep, we project a student model to the closest teacher model that satisfies the constraint. The next student model then learns from both labeled data and pseudo labels from the teacher model. In the standard self-training, we do not have any constraint and we have ft = ht. 6 Experiments We provide both synthetic and real-world experiments to support our theoretical results and clearly illustrate interesting tradeoffs of incorporating explanations. In our experiments, we compare our method against 3 baselines: (1) a standard supervised learning approach, (2) a simple Lagrangianregularized method (that directly penalizes the surrogate loss ϕ), and (3) self-training, which propagates the predictions of (1) and matches them on unlabeled data. We remark that (2) captures the essence of the method in Ross et al. [33], except there is no ℓ2 regularization term. Our experiments demonstrate that the proposed variational approach is preferable to simple Lagrangian methods and other supervised methods in many cases. In particular, the variational approach leads to a higher accuracy under limited labeled data settings. In addition, our method leads 10 20 30 40 50 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational Figure 3: Comparison of MSE on regressing a linear model. Results are averaged over 5 seeds. m = 1000, k = 20. 10 20 30 40 50 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational 10 20 30 40 50 Number of Labeled Data (n) Gradient Distance (log scale) Supervised Lagrangian Lagrangian + Self-train Variational Figure 4: Comparison of MSE on regressing a two layer neural network (left) and ℓ2 distance over input gradients as we vary the amount of labeled data n (right). Left is task performance and right is explanation constraint satisfcation. Results are averaged over 5 seeds. m = 1000, k = 20. to models that satisfy the explanation constraints much more frequently than other baselines. We also compare to a Lagrangian-regularized + self-training baseline (first, we use the model (2) to generate pseudolabels for unlabeled data and then train a new model on both labeled and unlabeled data) in Appendix L. We remark that this baseline isn t a standard method in practice and does not fit nicely into a theoretical framework, although it seems to be the most natural approach to using unlabeled data in this procedure. More extensive ablations are deferred to Appendix N, and code to replicate our experiments will be released with the full paper. 6.1 Regression Task with Exact Gradient Information In our synthetic experiments, we focus on a regression task where we try to learn some model contained in our hypothesis class. Our data is given by X = Rd, and we try to learn a target function h : X R. Our data distribution is given by X N(0, σ2I), where I is a d d identity matrix. We generate h by randomly initializing a model in the specific hypothesis class H. We assume that we have n labeled data, m unlabeled data, and k data with explanations. We first present a synthetic experiment for learning with a perfect explanation, meaning that ϕ(h , S) = 0. We consider the case where we have the exact gradient of h . Here, let H be a linear classifier and note that the exact gradient gives us the slope of the linear model, and we only need to learn the bias term. Incorporating these explanation indeed helps as both methods that include explanation constraints (Lagrangian and ours) perform much better (Figure 3). We also demonstrate incorporating this information for two layer neural networks. We observe a clear difference between the simpler Lagrangian approach and our variational objective (Figure 4 - left). Our method is clearly the best in the setting with limited labeled data and matches the performance of the strong self-training baseline with sufficient labeled data. We note that this is somewhat expected, as these constraints primarily help in the setting with limited labeled data; with enough labeled data, standard PAC bounds suffice for strong performance. We also analyze how strongly the approaches enforce these explanation constraints on new data points that are seen at test time (Figure 4 - right) for two layer NNs. We observe that our variational objective approaches have input gradients that more closely match the ground-truth target network s 10 15 20 25 30 35 40 45 50 Number of Labeled Data (n) Supervised Lagrangian Lagrangian + Self-train Variational 10 15 20 25 30 35 40 45 50 Number of Labeled Data (n) Supervised Lagrangian Lagrangian + Self-train Variational Figure 5: Comparison of accuracy on the You Tube (left) and the Yelp (right) datasets. Here, we let m = 500, k = 150, T = 2, τ = 0.0. Results are averaged over 40 seeds. input gradients. This demonstrates that, in the case of two layer NNs with gradient explanations, our approach best achieves both good performance and satisfying the constraints. Standard selftraining achieves similar performance in terms of MSE but has no notion of satisfying the explanation constraints. The Lagrangian method does not achieve the same level of satisfying these explanations as it is unable to generalize and satisfy these constraints on new data. 6.2 Tasks with Imperfect Explanations Assuming access to perfect explanations may be unrealistic in practice, so we present experiments when our explanations are imperfect. We present classification tasks (Figure 5) from a weak supervision benchmark [46]. In this setting, we obtain explanations through the approximate gradients of a single weak labeler, as is done in [? ]. More explicitly, weak labelers are heuristics designed by domain experts; one example is functions that check for the presence of particular words in a sentence (e.g., checking for the word delicious in a Yelp comment, which would indicate positive sentiment). We can then access gradient information from such weak labelers, which gives us a notion of feature importance about particular features in our data. We note that these examples of gradient information are rather easy to obtain, as we only need domain experts to specify simple heuristic functions for a particular task. Once given these functions, we can apply them easily over unlabeled data without requiring any example-level annotations. We observe that our variational objective achieves better performance than all other baseline approaches on the majority of settings defined by the number of labeled data. We remark that the explanation in this dataset is a noisy gradient explanation along two feature dimensions, yet this still improves upon methods that do not incorporate this explanation constraint. Indeed, our method outperforms the Lagrangian approach, showing the benefits of iterative rounds of self-training over the unlabeled data. In addition to our real-world experiments, we present synthetic experiments with noisy gradients in Appendix K.1. 7 Discussion Our work proposes a new learning theoretic framework that provides insight into how apriori explanations of desired model behavior can benefit the standard machine learning pipeline. The statistical benefits of explanations arise from constraining the hypothesis class: explanation samples serve to better estimate the population explanation constraint, which constrains the hypothesis class. This is to be contrasted with the statistical benefit of labeled samples, which serve to get a better estimate of the population risk. We provide instantiations of our analysis for the canonical class of gradient explanations, which captures many explanations in terms of feature importance. It would be of interest to provide corollaries for other types of explanations in future work. As mentioned before, the generality of our framework has larger implications towards incorporating constraints that are not considered as standard explanations. For example, this work can be leveraged to incorporate more general notions of side information and inductive biases. We also discuss the societal impacts of our approach in Appendix O. As a whole, our paper supports using further information (e.g., explanation constraints) in the standard learning setting. 8 Acknowledgements This work was supported in part by DARPA under cooperative agreement HR00112020003, FA875023-2-1015, ONR grant N00014-23-1-2368, NSF grant IIS-1909816, a Bloomberg Data Science Ph D fellowship and funding from Bosch Center for Artificial Intelligence and the ARCS Foundation. [1] M.-F. Balcan and A. Blum. A pac-style model for learning from labeled and unlabeled data. In International Conference on Computational Learning Theory, pages 111 126. Springer, 2005. 2 [2] M.-F. Balcan and A. Blum. A discriminative model for semi-supervised learning. Journal of the ACM (JACM), 57(3):1 46, 2010. 2, 3, 5, 20 [3] B. Bilodeau, N. Jaques, P. W. Koh, and B. Kim. Impossibility theorems for feature attribution. ar Xiv preprint ar Xiv:2212.11870, 2022. 3 [4] J. R. Birge and F. Louveaux. Introduction to stochastic programming. Springer Science & Business Media, 2011. 2 [5] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542 542, 2009. 3 [6] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. Advances in neural information processing systems, 30, 2017. 3 [7] F. Fioretto, P. Van Hentenryck, T. W. Mak, C. Tran, F. Baldo, and M. Lombardi. Lagrangian duality for constrained deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 118 135. Springer, 2021. 2 [8] S. Frei, D. Zou, Z. Chen, and Q. Gu. Self-training converts weak learners to strong learners in mixture models. In International Conference on Artificial Intelligence and Statistics, pages 8003 8021. PMLR, 2022. 3 [9] K. Ganchev, J. Grac a, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001 2049, 2010. 2, 8 [10] S. Garg and Y. Liang. Functional regularization for representation learning: A unified theoretical perspective. Advances in Neural Information Processing Systems, 33:17187 17199, 2020. 3 [11] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016. 3 [12] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee. Counterfactual visual explanations. In International Conference on Machine Learning, pages 2376 2384. PMLR, 2019. 3 [13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. 2 [14] Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2410 2420, 2016. 2, 8 [15] A. A. Ismail, H. Corrada Bravo, and S. Feizi. Improving deep learning interpretability by saliency guided training. Advances in Neural Information Processing Systems, 34:26726 26739, 2021. 1, 3 [16] P. Kall, S. W. Wallace, and P. Kall. Stochastic programming, volume 6. Springer, 1994. 2 [17] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668 2677. PMLR, 2018. 3 [18] P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885 1894. PMLR, 2017. 1, 3 [19] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338 5348. PMLR, 2020. 3 [20] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436 444, 2015. 3 [21] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991. 15 [22] J. Li, V. Nagarajan, G. Plumb, and A. Talwalkar. A learning theoretic perspective on local explainability. In International Conference on Learning Representations, 2020. 3 [23] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. 3 [24] T. Ma. Lecture notes from machine learning theory, 2022. URL http://web.stanford.edu/ class/stats214/. 7, 14, 26 [25] R. K. Mothilal, A. Sharma, and C. Tan. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 607 617, 2020. 3 [26] W. J. Murdoch, P. J. Liu, and B. Yu. Beyond word importance: Contextual decomposition to extract interactions from lstms. In International Conference on Learning Representations, 2018. 3 [27] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013. 17 [28] R. Pukdee, D. Sam, P. K. Ravikumar, and N. Balcan. Label propagation with weak supervision. In The Eleventh International Conference on Learning Representations, 2022. 17 [29] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R e. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access, 2017. 17 [30] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. R e. Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29, 2016. 17, 35 [31] M. T. Ribeiro, S. Singh, and C. Guestrin. why should i trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135 1144, 2016. 1, 3 [32] L. Rieger, C. Singh, W. Murdoch, and B. Yu. Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. In International conference on machine learning, pages 8116 8126. PMLR, 2020. 1, 2, 3, 7 [33] A. S. Ross, M. C. Hughes, and F. Doshi-Velez. Right for the right reasons: training differentiable models by constraining their explanations. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2662 2670, 2017. 1, 2, 3, 8, 16 [34] H. Shao, O. Montasser, and A. Blum. A theory of PAC learnability under transformation invariances. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=l1Wlf Na Rk Kw. 3 [35] D. Smilkov, N. Thorat, B. Kim, F. Vi egas, and M. Wattenberg. Smoothgrad: removing noise by adding noise. ICML Workshop on Visualization for Deep Learning, 2017, 2017. 1, 3 [36] J. Stacey, Y. Belinkov, and M. Rei. Supervising model attention with human explanations for robust natural language inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11349 11357, 2022. 1, 3 [37] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319 3328. PMLR, 2017. 3 [38] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984. [39] S. Wachter, B. Mittelstadt, and C. Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech., 31:841, 2017. 3 [40] C. Wei, K. Shen, Y. Chen, and T. Ma. Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, 2020. 3 [41] M. R. Wicker, J. Heo, L. Costabello, and A. Weller. Robust explanation constraints for neural networks. In The Eleventh International Conference on Learning Representations, 2022. 3 [42] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687 10698, 2020. 3 [43] C.-K. Yeh, J. Kim, I. E.-H. Yen, and P. K. Ravikumar. Representer point selection for explaining deep neural networks. Advances in neural information processing systems, 31, 2018. 1, 3 [44] C.-K. Yeh, B. Kim, S. Arik, C.-L. Li, T. Pfister, and P. Ravikumar. On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems, 33:20554 20565, 2020. 3 [45] M. E. Zarlenga, P. Barbiero, G. Ciravegna, G. Marra, F. Giannini, M. Diligenti, F. Precioso, S. Melacci, A. Weller, P. Lio, et al. Concept embedding models. In Neur IPS 2022-36th Conference on Neural Information Processing Systems, 2022. 3 [46] J. Zhang, Y. Yu, Y. Li, Y. Wang, Y. Yang, M. Yang, and A. Ratner. Wrench: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 10, 38 A Uniform Convergence via Rademacher Complexity A standard tool for providing performance guarantees of supervised learning problems is a generalization bound via uniform convergence. We will first define the Rademacher complexity and its corresponding generalization bound. Definition 5. Let F be a family of functions mapping X R. Let S = {x1, . . . , xm} be a set of examples drawn i.i.d. from a distribution DX . Then, the empirical Rademacher complexity of F is defined as RS(F) = E σ i=1 σif(xi) where σ1, . . . , σm are independent random variables uniformly chosen from { 1, 1}. Definition 6. Let F be a family of functions mapping X R. Then, the Rademacher complexity of F is defined as Rn(F) = E S Dn X [RS(F)] . The Rademacher complexity is the expectation of the empirical Rademacher complexity, over n samples drawn i.i.d. from the distribution DX . Theorem A.1 (Rademacher-based uniform convergence). Let DX be a distribution over X, and F a family of functions mapping X [0, 1]. Let S = {x1, . . . , xn} be a set of samples drawn i.i.d. from DX , then with probability at least 1 δ over our draw S, |ED[f(x)] ˆES[f(x)]| 2Rn(F) + This holds for every function f F, and ˆES[f(x)] is expectation over a uniform distribution over S. This bound on the empirical Rademacher complexity leads to the standard generalization bound for supervised learning. Theorem A.2. For a binary classification setting when y { 1} with a zero-one loss, for H {h : X { 1, 1}} be a family of binary classifiers, let S = {(x1, y1), . . . , (xn, yn)} is drawn i.i.d. from D then with probability at least 1 δ, we have |err D(h) d err S(h)| Rn(H) + for every h H when err D(h) = Pr (x,y) D(h(x) = y) c err S(h) = 1 i=1 1[h(xi) = yi] is the empirical error on S. For a linear model with a bounded weights in ℓ2 norm, the Rademacher complexity is O( 1 n). We refer to the proof from Ma [24] for this result. Theorem A.3 (Rademacher complexity of a linear model ([24])). Let X be an instance space in Rd, let DX be a distribution on X, let H = {h : x wh, x | wh Rd, ||wh||2 B} be a class of linear model with weights bounded by some constant B > 0 in ℓ2 norm. Assume that there exists a constant C > 0 such that Ex DX [||x||2 2] C2. For any S = {x1, . . . , xn} is drawn i.i.d. from DX , we have i=1 ||xi||2 2 Rn(H) BC n . Many of our proofs require the usage of Talgrand s lemma, which we now present. Lemma A.4 (Talgrand s Lemma [21]). Let ϕ : R R be a k-Lipschitz function. Then for a hypothesis class H = {h : Rd R}, we have that RS(ϕ H) k Rs(H) where ϕ H = {f : z 7 ϕ(h(z))|h H}. B Generalizable Constraints We know that constraints C(x) capture human knowledge about how explanations at a point x should behave. For any constraints C(x) that are known apriori for all x X, we can evaluate whether a model satisfies the constraints at a point x X. This motivates us to discuss the ability of models to generalize from any finite samples SE to satisfy these constraints over X with high probability. Having access to C(x) is equivalent to knowing how models should behave over all possible data points in terms of explanations, which may be too strong of an assumption. Nevertheless, many forms of human knowledge can be represented by a closed-form function C(x). For example, 1. An explanation has to take value in a fixed range can be represented by C(x) = Πr i=1[ai, bi], x X. 2. An explanation has to stay in a ball around x can be represented by C(x) = {u Rd | ||u x||2 r}. 3. An explanation has to stay in a rectangle around x 3 can be represented by C(x) = {u Rd | xi 3 + bi, i = 1, . . . , d}. ball around fixed interval rectangle around Figure 6: Illustration of examples of explanation constraints, given from some learnable class C(x). In this case, there always exists a surrogate loss that represents the explanation constraints C(x); for example, we can set ϕ(h, x) = 1{g(h, x) C(x)}. On the other hand, directly specifying explanation constraints through a surrogate loss would also imply that C(x) is known apriori for all x X. The task of generalization to satisfy the constraint on unseen data is well-defined in this setting. Furthermore, if a surrogate loss ϕ is specified, then we can evaluate ϕ(h, x) on any unlabeled data point without the need for human annotators which is a desirable property. On the other hand, we usually do not have knowledge over all data points x X; rather, we may only know these explanation constraints over a random sample of k data points SE = {x 1, . . . , x k}. If we do not know the constraint set C(x), it is unclear what satisfying the constraint at an unseen data point x means. Indeed, without additional assumptions, it may not make sense to think about generalization. For example, if there is no relationship between C(x) for different values of x, then it is not possible to infer about C(x) from C(x i) for i = 1, . . . , k. In this case, we could define ϕ(h, x) = 1{g(h, x) C(x)}1{x SE}, where we are only interested in satisfying these explanation constraints over the finite sample SE. For other data points, we have ϕ(h, x) = 0. This guarantees that any model with low empirical explanation loss would also achieve loss expected explanation loss, although this does not have any particular implication on any notion of generalization to new constraints. Regardless, we note that our explanation constraints still reduce the size of the hypothesis class from H to Hϕ,τ, leading to an improvement in sample complexity. The more interesting setting, however, is when we make an additional assumption that the true (unknown) surrogate loss ϕ exists and, during training, we only have access to instances of this surrogate loss evaluated on the sample ϕ( , x i). We can apply a uniform convergence argument to achieve ϕ(h, DX ) ϕ(h, SE) + 2Rk(G) + with probability at least 1 δ over SE, drawn i.i.d. from DX and G = {ϕ(h, )|h H}, k = |SE|. Although the complexity term Rk(G) is unknown (since ϕ is unknown), we can upper bound this by the complexity of a class of functions Φ (e.g., neural networks) that is large enough to wellapproximate any ϕ(h, ) G, meaning that Rk(G) Rk(Φ). Comparing to the former case when C(x) is known for all x X apriori, the generalization bound has a term that increases from Rk(G) to Rk(Φ), which may require more explanation-annotated data to guarantee generalization to new data points. We note that the simpler constraints lead to a simpler surrogate loss, which in turn implies a less complex upper bound Φ. This means that simpler constraints are easier to learn. Nonetheless, this is a more realistic setting when explanation constraints are hard to acquire and we do not have the constraints for all data points in X. For example, Ross et al. [33] considers an image classification task on MNIST, and imposes an explanation constraint in terms of penalizing the input gradient of the background of images. In essence, the idea is that the background should be less important than the foreground for the classification task. In general, this constraint does not have a closed-form expression, and we do not even have access to the constraint for unseen data points. However, if we assume that a surrogate loss ϕ(h, ) can be well-approximated by two layer neural networks, then our generalization bound allows us to reason about the ability of model to generalize and ignore background features on new data. C Goodness of an explanation constraint Definition 7 (Goodness of an explanation constraint). For a hypothesis class H, a distribution D and an explanation loss ϕ, the goodness of ϕ with respect to a threshold τ and n labeled examples is: Gn,τ(ϕ, H) = (Rn(H) Rn(Hϕ,τ)) + (err D(h ) err D(h t )) h = arg min h H err D(h), h τ = arg min h Hϕ,τ err D(h). Here, we assume access to infinite explanation data so that εk 0. The goodness depends on the number of labeled examples n and a threshold t. In our definition, a good explanation constraint leads to a reduction in the complexity of H while still containing a classifier with low error. This suggests that the benefits from explanation constraints exhibit diminishing returns as n becomes large. In fact, as n , we have Rn(H) 0, Rn(Hϕ,τ) 0 which implies Gn(ϕ, H) err D(h ) err D(h τ) 0. On the other hand, explanation constraints help when n is small. For t large enough, we expect err D(h ) err D(h τ) to be small, so that our notion of goodness is dominated by the first term: Rn(H) Rn(Hϕ,τ), which has the simple interpretation of reduction in model complexity. D Examples for Generalizable constraints In this section, we look at the Rademacher complexity of G for different explanation constraints to characterize how many samples with explanation constraints are required in order to generalize to satisfying the explanation constraints on unseen data. We remark that this is a different notion of sample complexity; these unlabeled data require annotations of explanation constraints, not standard labels. In practice, this can be easier and less expertise might be necessary if define the surrogate loss ϕ directly. First, we analyze the case where our explanation is given by the gradient of a linear model. Proposition D.1 (Learning a gradient constraint for linear models). Let D be a distribution over Rd. Let H = {h : x 7 wh, x | wh Rd, wh 2 B} be a class of linear models that pass through the origin. Let ϕ(h, x) = θ(wh, wh ) be a surrogate explanation loss. Let G = {ϕ(h, ) | h H}, then we have Rn(G) π 2 m. Proof. For a linear separator, ϕ(h, ) is a constant function over X. The Rademacher complexity is given by Rn(G) = E x D sup ϕ(h, ) G i=1 σiϕ(h, xi) sup h H θ(wh, wh ) We compare this with the Rademacher complexity of linear models which is given by Rm(H) B m. The upper bound does not depend on the upper bound on the weight B. In practice, we know that the gradient of a linear model is constant for any data point. This implies that knowing a gradient of a single point is enough to identify the gradient of the linear model. We consider another type of explanation constraint that is given by a noisy model. Here, we could observe either a noisy classifier and noisy regressor, and the constraint could be given by having similar outputs to this noisy model. This is reminiscent of learning with noisy labels [27] or weak supervision [30, 29, 28]. In this case, our explanation g is simply the hypothesis element h itself, and our constraint is on the values that h(x) can take. We first analyze this in the classification setting. Proposition D.2 (Learning a constraint given by a noisy classifier). Let D be a distribution over Rd. Consider a binary classification task with Y = { 1, 1}. Let H be a hypothesis class. Let ϕ(h, x) = 1[h(x) = h (x)] be a surrogate explanation loss. Let G = {ϕ(h, ) | h H}, then we have Rn(G) = E x D sup ϕ(h, ) G i=1 σiϕ(h, xi) i=1 σi(1 h(x)h (x) i=1 σi(h(x)h (x) i=1 σi(h(x) Here, to learn the restriction of G is on the same order of Rn(H). For a given noisy regressor, we observe slightly different upper bound. Proposition D.3 (Learning a constraint given by a noisy regressor). Let D be a distribution over Rd. Consider a regression task with Y = R. Let H be a hypothesis class that h H, h H. Let ϕ(h, x) = |h(x) h (x)| be a surrogate explanation loss. Let G = {ϕ(h, ) | h H}, then we have Rn(G) 2Rn(H). Rn(G) = E x D sup ϕ(h, ) G i=1 σiϕ(h, xi) i=1 σi|h(xi) h (xi)| i=1 σi max(0, h(xi) h (xi)) + 1 i=1 σi max(0, h (xi) h(xi)) i=1 σi max(0, h(xi) h (xi)) i=1 σi max(0, h (xi) h(xi)) i=1 σi(h(xi) h (xi)) i=1 σi(h (xi) h(xi)) where in the last line, we apply Talgrand s lemma A.4 and note that the max function max(0, h(x)) is 1-Lipschitz; in the third line, we note that we break up the supremum as both terms by definition of the max function are non-negative. Then, noting that we do not optimize over h (x), we further simplify this as Rn(G) E x D i=1 σih(xi) i=1 σi( h(xi)) As mentioned before, knowing apriori surrogate loss ϕ might be too strong. In practice, we may only have access to the instances ϕ( , xi) on a set of samples S = {x1, . . . , xk}. We also consider the case when ϕ(h, x) = |h(x) h (x)| when h is unknown and h belongs to a learnable class C. Proposition D.4 (Learning a constraint given by a noisy regressor from some learnable class C). Assume D is a distribution over Rd. Let H and D be hypothesis classes. Let ϕh (h, x) = |h(x) h (x)| be a surrogate explanation loss of a constraint corresponding to h . Let GC = {ϕh (h, )|h H, h C}, then we have Rn(GC) 2Rn(H) + 2Rn(C). Rn(GC) = E x D sup ϕ(h, ) GC i=1 σiϕ(h, xi) sup h H, h C i=1 σi|h(xi) h (xi)| sup h H, h C i=1 σi max(0, h(xi) h (xi)) sup h H, h C i=1 σi max(0, h (xi) h(xi)) sup h H, h C i=1 σi(h(xi) h (xi)) sup h H, h C i=1 σi(h (xi) h(xi)) where the lasts line again holds by an application of Talgrand s lemma. In this case, we indeed are optimizing over h , so we get that Rn(GC) 2 E x D i=1 σi(h(xi)) i=1 σi(h (xi)) = 2Rn(H) + 2Rn(C). We remark that while this value is much larger than that of Rn(H), we only need information about ϕ(h, x) and not the true label. Therefore, in many cases, this is preferable and not as expensive to learn. E Proof of Theorem 3.2 We consider the agnostic setting of Theorem 3.2. Here, we have two notions of deviations; one is deviation in a model s ability to satisfy explanations, and the other is a model s ability to generalize to correctly produce the target function. Proof. From Rademacher-based uniform convergence, for any h H, with probability at least 1 δ/2 over SE |ϕ(h, D) ϕ(h, SE)| 2Rk(G) + Therefore, with probability at least 1 δ/2, for any h Hϕ,t εk we also have ϕ(h, SE) t and for any h with ϕ(h, SE) t, we have h Hϕ,t+εk. In addition, by a uniform convergence bound, with probability at least 1 δ/2, for any h Hϕ,t+εk |err D(h) err S(h)| Rn(Hϕ,t+εk) + Now, let h be the minimizer of err S(h) given that ϕ(h, SE) t. By previous results, with probability 1 δ, we have h Hϕ,t+εk and err D(h ) err S(h ) + Rn(Hϕ,t+εk) + err S(h t εk) + Rn(Hϕ,t+εk) + err D(h t εk) + 2Rn(Hϕ,t+εk) + 2 F EPAC-PAC learnability We note that in our definition of EPAC learnability, we are only concerned with whether a model achieves a lower surrogate loss than τ. However, the objective of minimizing the EPAC-ERM objective is to achieve both low average error and low surrogate loss. We characterize this property as EPAC-PAC learnability. Definition 8 (EPAC-PAC learnability). For any δ (0, 1), τ > 0, the sample complexity of (δ, τ, γ) - EPAC learning of H with respect to a surrogate loss ϕ, denoted m(δ, τ, γ; H, ϕ) is defined as the smallest m N for which there exists a learning rule A such that every data distribution DX over X, with probability at least 1 δ over S Dm, ϕ(A(S), D) inf h H ϕ(h, D) + τ and err D(A(S)) inf h H err D(h) + γ. If no such m exists, define m(δ, τ, γ; H, ϕ) = . We say that H is EPAC-PAC learnable in the agnostic setting with respect to a surrogate loss ϕ if δ (0, 1), τ > 0, m(δ, τ, γ; H, ϕ) is finite. G A Generalization Bound in the Realizable Setting In this section, we assume that we are in the doubly realizable [2] setting where there exists h H such that err D(h ) = 0 and ϕ(h , D) = 0. The optimal classifier h lies in H and also achieve zero expected explanation loss. In this case, we want to output a hypothesis h that achieve both zero empirical risk and empirical explanation risk. Theorem G.1 (Generalization bound for the doubly realizable setting). For a hypothesis class H, a distribution D and an explanation loss ϕ. Assume that there exists h H that err D(h ) = 0 and ϕ(h , D) = 0. Let S = {(x1, y1), . . . , (xn, yn)} is drawn i.i.d. from D and SE = {x 1, . . . , x k} drawn i.i.d. from DX . With probability at least 1 δ, for any h H that err S(h) = 0 and ϕ(h, SE) = 0, we have err D(h) Rn(Hϕ,εk) + εk = 2Rk(G) + 2k when G = {ϕ(h, x) | h H, x X}. Proof. We first consider only classifiers than has low empirical explanation loss and then perform standard supervised learning. From Rademacher-based uniform convergence, for any h H, with probability at least 1 δ/2 over SE ϕ(h, D) ϕ(h, SE) + 2Rk(G) + when G = {ϕ(h, x) | h H, x X}. Therefore, for any h H with ϕ(h, SE) = 0, we have h Hϕ,εk with probability at least 1 δ/2. Now, we can apply the uniform convergence on Hϕ,εk. For any h Hϕ,εk with err S(h) = 0, with probability at least 1 δ/2, we have err D(h) Rn(Hϕ,εk) + Therefore, for h H that ϕ(h, SE) = 0, err S(h) = 0, we have our desired guarantee. We remark that, since our result relies on the underlying techniques of the Rademacher complexity, our result is on the order of O( 1 n). In the (doubly) realizable setting, this is somewhat loose, and more complicated techniques are required to produce tighter bounds. We leave this as an interesting direction for future work. H Rademacher Complexity of Linear Models with a Gradient Constraint We calculate the empirical Rademacher complexity of a linear model under a gradient constraint. Figure 7: Illustration of different value of a function f(v). Theorem H.1 (Empirical Rademacher complexity of linear models with a gradient constraint). Let X be an instance space in Rd, let DX be a distribution on X, let H = {h : x wh, x | wh Rd, ||wh||2 B} be a class of linear model with weights bounded by some constant B > 0 in ℓ2 norm. Assume that there exists a constant C > 0 such that Ex DX [||x||2 2] C2. Assume that we have an explanation constraint in terms of gradient constraint; we want the gradient of our linear model to be close to the gradient of some linear model h . Let ϕ(h, x) = θ(wh, wh ) be an explanation surrogate loss when θ(u, v) is an angle between u, v. For any S = {x1, . . . , xn} is drawn i.i.d. from DX , we have RS(Hϕ,τ) = B n Eσ [ v f(v)] . when v = Pn i=1 xiσi and 1 when θ(v, w ) τ cos(θ(v, w ) τ) when τ θ(v, w ) π 2 + τ 0 when θ(v, w ) π For the proof, we refer to Appendix H for full proof. We compare this with the standard bound on linear models which is given by Figure 8: Benefits of an explanation constraint also depend on the data distribution. We represent data points xi with red squares (Left). The possible regions for v = Pn i=1 xiσi are the shaded areas (Right). When the data is highly correlated with w , v would lie in a region where f(v) is large (Top) and this implies that the gradient constraints provide less benefits. On the other hand, when the data is almost orthogonal to w , v would lie in a region with a small value of f(v) (Bottom) which leads to more benefits from the gradient constraints . The benefits of the explanation constraints depend on the underlying data distribution; in the case of linear models with a gradient constraint, this depends on an angle between v = Pn i=1 xiσi and w . The explanation constraint reduces the term inside the expectation by a factor of f(v) depending on θ(v, w ). When θ(v, w ) τ then f(v) = 1 which implies that there is no reduction. The value of f(v) decreases as the angle between θ(v, w ) increases and reaches f(v) = 0 when θ(v, w ) π 2 +τ. When the data is concentrated around the area of w , the possible regions for v would be close to w or w (Figure 8 (Top)). The value of f(v) in this region would be either 1 or 0 and the reduction would be 1 2 on average. In essence, this means that the gradient constraint of being close to w does not actually tell us much information beyond the information from the data distribution. On the other hand, when the data points are nearly orthogonal to w , the possible regions for v would lead to a small f(v) (Figure 8 (Bottom)). This can lead to a large reduction in complexity. Intuitively, when the data is nearly orthogonal to w , there are many valid linear models including those not close in angle to w . The constraints allows us to effectively shrink down the class of linear models that are close to w . Proof. (Proof of Theorem H.1) Recall that Hϕ,τ = {h : x wh, x | wh Rd, ||wh||2 B, θ(wh, wh ) τ}. For a set of sample S, the empirical Rademacher complexity of Hϕ,τ is given by RS(Hϕ,τ) = 1 i=1 h(xi)σi sup wh 2 B θ(wh,wh ) τ i=1 wh, xi σi sup wh 2 B θ(wh,wh ) τ For a vector w Rd with w 2 = 1, and a vector v Rd, we will claim the following, 1. If θ(v, w ) τ, we have sup w 2 B θ(w,w ) τ w, v = B v . 2 + τ θ(v, w ) π, we have sup w 2 B θ(w,w ) τ 3. If τ θ(v, w ) π 2 + τ, we have sup w 2 B θ(w,w ) τ w, v = B v cos(θ(v, w ) τ) For the first claim, we can see that if θ(v, w ) τ, we can pick w = Bv v and achieve the optimum value. For the second claim, we use the fact that θ( , ) satisfies a triangle inequality and for any w that θ(w, w ) τ, we have θ(v, w) + θ(w, w ) θ(v, w ) θ(v, w) θ(v, w ) θ(w, w ) 2 + τ τ = π This implies that for any w that θ(w, w ) τ, we have w, v = w v cos(θ(v, w)) 0 and the supremum is given by 0 where we can set w = 0. For the third claim, we know that w, v is maximum when the angle between v, w is the smallest. From the triangle inequality above, we must have θ(w, w ) = τ to be the largest possible value so that we have the smallest lower bound θ(v, w) θ(v, w ) θ(w, w ). In addition, the inequality holds when v, w , w lie on the same plane. Since we do not have further restrictions on w, there exists such w and we have sup w 2 B θ(w,w ) τ w, v = B v cos(θ(v, w ) τ) as required. One can calculate a closed form formula for w by solving a quadratic equation. Let w = B w w when w = v + λw for some constant λ > 0 such that θ(w, w ) = τ. With this we have an equation v + λw = cos(τ) Let µ = v, w , solving for λ, we have v 2 + 2λµ + λ2 = cos(τ) µ2 + 2µλ + λ2 = cos2(τ)( v 2 + 2λµ + λ2) sin2(τ)λ2 + 2 sin2(τ)µλ + µ2 cos2(τ) v 2 = 0 λ2 + 2µλ + µ2 sin2(τ) cot2(τ) v 2 = 0 Solve this quadratic equation, we have λ = µ cot(τ) p Since λ > 0, we have λ = µ + cot(τ) p v 2 µ2. We have = v + ( µ + cot(τ) p = v v, w w + cot(τ)w p With these claims, we have RS(Hϕ,τ) = 1 sup wh 2 B θ(wh,wh ) τ n Eσ h v 1{θ(v, w ) τ} + v 1{τ θ(v, w ) π 2 + τ} cos(θ(v, w ) τ) i n Eσ [ v f(v)] . Theorem H.2 (Rademacher complexity of linear models with gradient constraint, uniform distribution on a sphere). Let X be an instance space in Rd, let DX be a uniform distribution on a unit sphere in Rd, let H = {h : x wh, x | wh Rd, ||wh||2 B} be a class of linear model with weights bounded by some constant B > 0 in ℓ2 norm. Assume that there exists a constant C > 0 such that Ex DX [||x||2 2] C2. Assume that we have an explanation constraint in terms of gradient constraint; we want the gradient of our linear model to be close to the gradient of some linear model h . Let ϕ(h, x) = θ(wh, wh ) be an explanation surrogate loss when θ(u, v) is an angle between u, v. We have Rn(Hϕ,τ) = B n sin(τ) p + 1 p Proof. From Theorem H.1, we have that Rn(Hϕ,τ) = E[RS(Hϕ,τ)] n ED h Eσ h v 1{θ(v, w ) τ} + v 1{τ θ(v, w ) π 2 + τ} cos(θ(v, w ) τ) ii n ED h Eσ h v 1{θ(v, w ) π 2 τ} + v 1{π 2 τ θ(v, w ) π 2 + τ} cos(θ(v, w ) τ) ii when v = Pn i=1 xiσi. Because xi is drawn uniformly from a unit sphere, in expectation θ(v, w ) has a uniform distribution over [0, π] and the distribution v for a fixed value of θ(v, w ) are the same for all θ(v, w ) [0, π]. From Trigonometry, we note that 2 2τ + a) + cos(π 2 a) = sin(2τ a) + sin(a) 2 sin(τ). By the symmetry property and the uniformity of the distribution of θ(v, w ) and v . ED h Eσ h v 1{π 2 τ θ(v, w ) π 2 + τ} cos(θ(v, w ) τ) ii = ED h Eσ h v 1{0 θ(v, w ) 2τ} cos(π 2 + θ(v, w ) τ) ii = ED h Eσ h v (1{0 θ(v, w ) τ} cos(π 2 + θ(v, w ) τ) + 1{τ θ(v, w ) 2τ} cos(π 2 + θ(v, w ) τ)) ii = ED h Eσ h v (1{0 θ(v, w ) τ} cos(π 2 + θ(v, w ) τ) + 1{0 2τ θ(v, w ) τ} cos(π 2 (2τ θ(v, w )))) ii = ED h Eσ h v (1{0 θ(v, w ) τ} cos(π 2 + θ(v, w ) τ) + 1{0 θ(v, w ) τ} cos(π 2 θ(v, w ))) ii = ED h Eσ h v (1{0 θ(v, w ) τ} cos(π 2 + θ(v, w ) τ) + 1{0 θ(v, w ) τ} cos(π 2 θ(v, w ))) ii ED h Eσ h v 1{π 2 τ θ(v, w ) π 2 + τ} sin(τ) ii when θ(v, w ) = π 2 θ(v, w ). We have n ED h Eσ h v 1{θ(v, w ) π 2 τ} + v 1{π 2 τ θ(v, w ) π 2 + τ} sin(τ) ii n ED [Eσ [ v ]] (Pr(θ(v, w ) π 2 τ) + Pr(π 2 τ θ(v, w ) π 2 + τ) sin(τ)) The last equation follows from the symmetry and uniformity properties. We can bound the first expectation ED[Eσ v ]] = ED[Eσ i=1 xiσi ]] i=1 xiσi 2]] i=1 xi 2σ2 i ]] Next, we can simply note that, since our data is distributed over a unit sphere, each data has norm no greater than 1. Therefore, we know that C = 1 is indeed an upper bound on Ex DX [||x||2 2]. For the probability term, we note that in expectation v has the same distribution as a random vector u drawn uniformly from a unit sphere. We let this be some probability p: 2 τ θ(v, w ) π 2 + τ = Pr (| u, w | sin(τ)) . We know that the projection u, w N(0, 1 d). Then, we have that | u, w | is given by a Folded Normal Distribution, which has a CDF given by Pr (| u, w | sin(τ)) = 1 We then observe that Pr θ(v, w ) π 2 τ θ(v, w ) π Plugging this in yields the following bound Rn(Hϕ,τ) = B n sin(τ) p + 1 p I Rademacher Complexity for Two Layer Neural Networks with a Gradient Constraint Here, we present the full proof of the generalization bound for two layer neural networks with gradient explanations. In our proof, we use two results from Ma [24]. One result is a technical lemma, and the other is a bound on the Rademacher complexity of two layer neural networks. Lemma I.1. Consider a set S = {x1, ..., xn} and a hypothesis class F {f : Rd R}. If i=1 f(xi)σi 0 for any σi { 1}, i = 1, ..., n, then, we have that i=1 f(xi)σi| i=1 f(xi)σi Theorem I.2 (Rademacher complexity for two layer neural networks [24]). Let X be an instance space and DX be a distribution over X. Let H = {h : x 7 Pm i=1 wiσ(u i x)|wi R, ui Rd, Pm i=1 |wi| ui 2 B} be a class of two layer neural networks with m hidden nodes with a Re LU activation function σ(x) = max(0, x). Assume that there exists some constant C > 0 such that Ex DX [ x 2 2] C2. Then, for any S = {x1, . . . , xn} is drawn i.i.d. from DX , we have that i=1 ||xi||2 2 and Rn(H) 2BC n . We defer interested readers to [24] for the full proof of this result. Here, the only requirement of the data distribution is that Ex DX [ x 2 2] C2. We now present our result in the setting of two layer neural networks with one hidden node m = 1 to provide clearer intuition for the overall proof. Theorem I.3 (Rademacher complexity for two layer neural networks (m = 1) with gradient constraints). Let X be an instance space and DX be a distribution over X. Let H = {h : x 7 wσ(u x)|w R, u Rd, |w| B, u = 1}. Without loss of generality, we assume that u = 1. Assume that there exists some constant C > 0 such that Ex DX [ x 2 2] C2. Our explanation constraint is given by a constraint on the gradient of our models, where we want the gradient of our learnt model to be close to a particular target function h H. Let this be represented by an explanation loss given by ϕ(h, x) = xh(x) xh (x) 2 + 1{ xh(x) xh (x) > τ} for some τ > 0. Let h (x) = w σ((u ) x) the target function, then we have Rn(Hϕ,τ) τC n if |w | > τ, Rn(Hϕ,τ) 3τC n if |w | τ. Figure 9: Visualization of the piecewise constant function of xh(x) xh (x) over 4 regions. Proof. Our choice of ϕ(h, x) guarantees that, for any h Hϕ,τ, we have that xh(x) xh (x) τ almost everywhere. We note that for h(x) = wσ(u x), the gradient is given by xh(x) = wu1{u x > 0}, which is a piecewise constant function over two regions (i.e., u x > 0, u x 0), captured by Figure I. We now consider xh(x) xh (x), and we have 3 possible cases. Case 1: θ(u, u ) > 0 This implies that the boundaries of x(h) and xh (x) are different. Then, we have that xh(x) xh (x) is a piecewise constant function with 4 regions, taking on values xh(x) xh (x) = wu w u when u x > 0, (u ) x > 0 wu when u x > 0, (u ) x < 0 w u when u x < 0, (u ) x > 0 0 when u x < 0, (u ) x < 0 If we assume that each region has probability mass greater than 0 then our constraint xh(x) xh (x) 2 τ implies that |w| = |w| u τ, |w | = |w | u τ, wu w u τ. Case 2: θ(u, u ) = 0 This implies that the boundary of xh(x) and xh (x) are the same. Then, xh(x) xh (x) is a piecewise constant over two regions xh(x) xh (x) = wu w u when u x > 0 0 when u x < 0 This gives us that |w w | = wu w u τ. Case 3: θ(u, u ) = π Here, we have that the decision boundaries of xh(x) and xh (x) are the same but the gradients are non-zero on different sides. Then, xh(x) xh (x) is a piecewise constant on two regions xh(x) xh (x) = wu when u x > 0 w u when u x < 0 This gives us that |w| τ and |w | τ. These different cases tell us that the constraint xh(x) xh (x) τ reduces H into a class of models follows either 1. u = u and |w w | < τ. 2. u = u and |w| < τ. However, this case only possible when |w | < τ. If |w | > τ, we know that we must only have the first case. Now, we can calculate the Rademacher complexity of our restricted class Hϕ,τ. We will again do this in separate cases. Case 1: |w | > τ For any h Hϕ,τ, we have that u = u and |w w | < τ. For a sample S = {x1, ..., xn}, Rs(Hϕ,τ) = 1 i=1 h(xi)σi i=1 wσ((u ) xi)σi ( as u = u ) i=1 σ((u ) xi)σi Since, |w w | < τ, w τ < w < w + τ Then, we can compute the supremum over w as w = w τ if Pn i=1 σ((u ) xi)σi < 0 w + τ if Pn i=1 σ((u ) xi)σi 0 Therefore, we have i=1 σ((u ) xi)σi i=1 σ((u ) xi)σi i=1 σ((u ) xi)σi Now, we can calculate the Rademacher complexity as RS(Hϕ,τ) = 1 i=1 σ((u ) xi)σi i=1 σ((u ) xi)σi| i=1 σ((u ) xi)σi| i=1 σ((u ) xi)σi 2 # (Jensen s inequality) i=1 σ((u ) xi)2σ2 i (since σi, σj are independent with mean 0) i=1 ((u ) xi)2 Combining this with the fact that E x 2 C2, we have Rn(Hϕ,τ) = E[RS(Hϕ,τ)] i=1 xi 2] (Jensen s inequality) Case 2: |w | u < τ. We know that Hϕ,τ = H(1) ϕ,τ S H(2) ϕ,τ when H(1) ϕ,τ = {h H|h : x wσ(u x), u = u , |w w | < τ} H(2) ϕ,τ = {h H|h : x wσ(u x), u = 1, u = u , |w| < τ} RS(Hϕ,τ) = 1 i=1 h(xi)σi i=1 h(xi)σi + sup i=1 h(xi)σi = RS(H(1) ϕ,τ) + RS(H(2) ϕ,τ) The second line holds as supx A B f(x) supx A f(x) + supx B f(x) when supx A f(x) 0 and supx B f(x) 0. We know that both of these supremums be greater than zero, as we can recover the value of 0 with w = 0. From Case 1, we know that Rn(H(1) ϕ,τ) τC n. We also note that H(2) ϕ,τ is a class of two layer neural networks with weights with norms bounded by τ. From Theorem I.2, we have that Rn(H(2) ϕ,τ) 2τC n . Therefore, in Case 2, Rn(Hϕ,τ) 3τC n . as required. Now, we consider in the general setting (i.e., no restriction on m). Theorem I.4 (Rademacher complexity for two layer neural networks with gradient constraints ). Let X be an instance space and DX be a distribution over X with a large enough support. Let H = {h : x 7 Pm j=1 wjσ(u j x)|wj R, uj Rd, uj 2 = 1, Pm j=1 |wj| B}. Assume that there exists some constant C > 0 such that Ex DX [ x 2 2] C2. Our explanation constraint is given by a constraint on the gradient of our models, where we want the gradient of our learnt model to be close to a particular target function h H. Let this be represented by an explanation loss given by ϕ(h, x) = xh(x) xh (x) 2 + 1{ xh(x) xh (x) > τ} for some τ > 0. Then, we have that Rn(Hϕ,τ) 3τm C n . To be precise, Rn(Hϕ,τ) (2m + q)τC n . when q is the number of node j of h such that |w j| < τ. We note that this result indeed depends on the number of hidden dimensions m; however, we note that in the general case (Theorem I.2), the value of B is O(m) as it is a sum over the values of each hidden node. We now present the proof for the more general version of our theorem. Proof. For simplicity, we first assume that any h H has that uj = 1, j. Consider h H, we write h = Pm j=1 w jσ((u j) x) and let h (x) = Pm j=1 w jσ((u j) x) be a function for our gradient constraint. The gradient of a hypothesis h is given by j=1 wjuj 1{u j x > 0}, which is a piecewise constant function over at most 2m regions. Then, we consider that xh(x) xh (x) = j=1 wjuj 1{u j x > 0} j=1 w ju j 1{(u j) x > 0}, which is a piecewise constant function over at most 22m regions. We again make an assumption that each of these regions has a non-zero probability mass. Our choice of ϕ(h, x) guarantees that the norm of the gradient in each region is less than τ. Similar to the case with m = 1, we will show that the gradient constraint leads to a class of functions with the same decision boundary or neural networks that have weights with a small norm. Assume that among u1, ..., um there are k vectors that have the same direction as u 1, ..., u m. Without loss of generality, let uj = u j for j = 1, ..., k. In this case, we have that xh(x) xh (x) is a piecewise function over 22m k regions. As each region has non-zero probability mass, for each j {1, ..., k}, we know that x such that u j x = (u j) x > 0, u i x < 0 for i = j, (u i) x < 0 for i = j. In other words, we can observe a data point from each region that uniquely defines the value of a particular wj, uj. In this case, we have that xh(x) xh (x) = wjuj w ju j = (wj w j)u j. From our gradient constraint, we know that || xh(x) xh (x)|| τ, x, which implies that |wj w j| τ for j = 1, ..., k. On the other hand, for the remaining j = k + 1, ..., m, we know that there exists x such that u j x > 0, u i x < 0 for i = j, (u i) x < 0 for i = 1, ..., m. Then, we have that xh(x) = wjuj, and our constraint implies that |wj| uj = |wj| τ. Similarly, we have that |w j| u j = |w j| < τ, for j = k + 1, ..., m. We can conclude that Hϕ,τ is a class of two layer neural networks with m hidden nodes (assuming ui = 1) that for each node wjσ(u j x) satisfies 1. There exists l [m] that uj = u l and |wj w l| < τ. 2. |wj| < τ We further note that for a node w lσ((u l) x) in h (x) that has that a high weight |w l| > τ, there must be a node wjσ(u j x) in h with the same boundary uj = ul. Otherwise, there is a contradiction with |w l| < τ for all nodes in h without a node in h with the same boundary. We can utilize this characterization of the restricted class Hϕ,τ to bound the Rademacher complexity of the class. Let H = {h : x 7 j=1 w jσ((u j) x)aj | aj {0, 1} and for j that |w j| > τ, aj = 1}. This is a class of two layer neural networks with at most m nodes such that each node is from h . We also have a condition that if the weight of the j-th node in h is greater than τ, the j-th node must be present in any member of this class. Let H(τ) = {h : x 7 j=1 wjσ((uj) x)aj | wj R, uj Rd, |wj| < τ, uj = 1}. be a class of two layer neural networks with m nodes such that the weight of each node is at most τ. We claim that for any h Hϕ,τ there exists h1 H , h2 H(τ) that h = h1 + h2. For any h Hϕ,τ, let ph : [m] [m] {0} be a function that match a node in h with the node with the same boundary in h . Formally, ph(j) = l when uj = u l 0 otherwise. The function ph maps j to 0 if there is no node in h with the same boundary. Let w 0 = 0, u 0 = [0, . . . , 0], we can write j=1 wjσ(u j x) j=1 wjσ(u j x) w ph(j)σ((u ) ph(j)x) + w ph(j)σ((u ) ph(j)x) ph(j) =0 (wj w ph(j))σ((u ) ph(j)x) + X ph(j)=0 wjσ(u j x) | {z } H(τ) p(j) =0 w ph(j)σ((u ) ph(j)x) The first term is a member of H(τ) because we know that |wj w p(j)| < τ or |wj| < τ. The second term is also a member of H since for any l that |w l| > τ, there exists j that ph(j) = l. Therefore, we can write h in terms of a sum between a member of H and H(τ). This implies that Rn(Hϕ,τ) Rn(H ) + Rn(H(τ)). From Theorem I.2, we have that Rn(H(τ) ϕ,τ) 2τm C n . Now, we will calculate the Rademacher complexity of H . For S = {x1, . . . , xn}, i=1 h(xi)σi j=1 w jσ((u j) xi)aj)σi |w j|<τ w jσ((u j) xi)aj + X |w j|>τ w jσ((u j) xi))σi sup aj {0,1} |w j|<τ w jσ((u j) xi)ajσi sup aj {0,1} |w j|<τ aj(w j i=1 σ((u j) xi)σi) To achieve the supremum, if w j Pn i=1 σ((u j) xi)σi > 0 we need to set aj = 1, otherwise, we need to set aj = 0. Therefore, sup aj {0,1} |w j|<τ aj(w j i=1 σ((u j) xi)σi) |w j|<τ σ(w j i=1 σ((u j) xi)σi) |w j|<τ (w j i=1 σ((u j) xi)σi) + |w j i=1 σ((u j) xi)σi| (σ(x) = x + |x| |w j|<τ |w j i=1 σ((u j) xi)σi| |w j|<τ |w j| i=1 σ(u xi)σi| |w j|<τ |w j| i=1 σ(u xi)σi (Lemma I.1) |w j|<τ |w j| " 1 n sup u =1 | {z } Empirical Rademacher complexity of a linear model (Talagrand s Lemma). From Theorem 4.1, we can conclude that |w j|<τ |w j| C n qτC n mτC n when q is the number of nodes j of h such that |w j| < τ. Therefore, Rn(H ) (2m + q)τC n 3mτC n . A tighter bound is given by (2m+q)τC n when q is the number of w j that |w j| < τ. As τ 0, we also have q 0. This implies that we have an upper bound of 2mτC n if τ is small enough. When comparing this to the original bound 2BC n , we can do much better if τ B m. We would like to point out that our bound does not depend on the distribution D because we choose a strong explanation loss ϕ(h, x) = xh(x) xh (x) 2 + 1{ xh(x) xh (x) > τ} which guarantees that xh(x) xh (x) 2 τ almost everywhere. We also assume that we are in a high-dimensional setting d m and there exists x with a positive probability density at any partition created by xh(x). J Algorithmic Results for Two Layer Neural Networks with a Gradient Constraint Now that we have provided generalization bounds for the restricted class of two layer neural networks, we also present an algorithm that can identify the parameters of a two layer neural network (up to a permutation of the weights). In practice, we might solve this via our variational objective or other simpler regularized techniques. However, we also provide a theoretical result for the required amount of data (given some assumptions about the data distribution) and runtime for an algorithm to exactly recover the parameters of these networks under gradient constraints. We again know that the gradient of two layer neural networks with Re LU activations can be written as i=1 wiui 1{u T i x > 0}, where we consider ||ui|| = 1. Therefore, an exact gradient constraint given of the form of pairs (x, xf(x)) produces a system of equations. Proposition J.1. If the values of ui s are known, we can identify the parameters wi with exactly m fixed samples. Proof. We can select m datapoints, which each achieve value 1 for the indicator value in the gradient of the two layer neural network. This would give us m equations, which each are of the form xfw,U(xi) = wiui. Therefore, we can easily solve for the values of wi, given that ui is known. To make this more general, we now consider the case where ui s are not known but are at least linearly independent. Proposition J.2. Let the ui s be linearly independent. Assume that each region of the data (when partitioned by the values of ui) has non-trivial support > p. Then, with probability 1 δ, we can identify the parameters wi, ui with O 2m + m+log( 1 δ ) log( 1 1 p ) data points and in O(22m) time. Proof. Let us partition X into regions satisfying unique values of the binary vector (1{u T 1 x > 0}, ..., 1{u T mx > 0}), which by our assumption each have at least some probability mass p. First, we calculate the probability that we observe one data point with an explanation from each region in this partition. This is equivalent to sampling from a multinomial distribution with probabilities (p1, ..., p2m), where pi p, i. Then, Pr(observe all regions in n draws) = 1 Pr( i s.t. we do not observe region i) = 1 2m(1 p)n. Algorithm 1 Algorithm for identifying parameters of a two layer neural network, given exact gradient constraints 1: Input: We are given M = {P x C x|C P({x1, ..., xm})}, with {x1, ..., xm} linearly independent 2: Output: The set of basis elements {x1, ..., xm} 3: function 4: B = {}, S = {} {Set for basis vectors and set for a current sum of at least 2 elements} 5: for x M do 6: if x S then 7: pass 8: else 9: B = B {x} 10: if |B| = 2 then 11: S = {y1 + y2}, where B = {y1, y2} 12: else 13: S = S {y + x|y S} {Updating sums from adding x} 14: end if 15: O = B S {Computing overlap between current basis and sums} 16: B = B \ O {Removing elements contained in pairwise span} 17: S = {y yo|y S, yo O} {Updating sums S from removing set O} 18: end if 19: end for 20: return B 21: end function Setting this as no less than 1 δ leads to that n m+log( 1 δ ) log( 1 1 p ) . Given O(2m + m+log( 1 δ ) log( 1 1 p ) ) pairs of data and gradients, we will observe at least one pair from each region of the partition. Then, identifying the values of ui s and wi s is equivalent to identifying the datapoints that correspond to a value of the binary vector where only one indicator value is 1. These values can be identified in O(23m) time; the algorithm is given in Algorithm J.1. These results demonstrate that we can indeed learn the parameters (up to a permutation) of a two layer neural network given exact gradient information. J.1 Algorithm for Identifying Regions We first note that identifying the parameters ui s and wi s of a two layer neural network is equivalent to identifying the values {x1, ..., xm} from the set {P x C x|C P({x1, ..., xm})}, where P denotes the power set. We also assume that x1, ..., xm are linearly independent, so we cannot create xi from any linear combination of xj s with j = i. Then, we can identify the set {x1, ..., xm} as in Algorithm 1. This algorithm runs in O(23m) time as it iterates through each point in M and computes the overlapping set O and resulting updated sum S, which takes O(22m) time. From the resulting set B, we can exactly compute values ui and wi up to a permutation. K Additional Synthetic Experiments We now present additional synthetic experiments that demonstrate the performance of our approach under settings with imperfect explanations and compare the benefits of using different types of explanations. K.1 Variational Objective is Better with Noisy Gradient Explanations Here, we present the remainder of the results from the synthetic regression task of ??, under more settings of noise ϵ added to the gradient explanation. Again, we observe that our method does better than that of the Lagrangian approach and the selftraining method. Under high levels of noise, the Lagrangian method does poorly. On the contrary, 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.0001 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.1 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.5 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 1 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 2 Supervised Lagrangian Lagrangian + Self-train Variational Figure 10: Comparison of MSE on regressing a two layer neural network with explanations of noisy gradients. m = 1000, k = 20, λ = 10. For the iterative methods, T = 10. Results are averaged over 5 seeds. our method is resistant to this noise and also outperforms self-training significantly in settings with limited labeled data. K.2 Comparing Different Types of Explanations Here, we present synthetic results to compare using different types of explanation constraints. We focus on comparing noisy gradients as before, as well as noisy classifiers, which are used in the setting of weak supervision [30]. Here, we generate our noisy classifiers as h (x) + ϵ, where ϵ N(0, σ2). We omit the results of self-training as it does not use any explanations, and we keep the supervised method as a baseline. Here, t = 0.25. We observe different trends in performance as we vary the amount of noise in the noisy gradient or noisy classifier explanations. With any amount of noise and sufficient regularization (λ), this influences the overall performance of the methods that incorporate constraints. With few labeled data, using noisy classifiers helps outperform standard supervised learning. With a larger amount 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.0001 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.1 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.5 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.0001 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.1 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.5 Supervised Lagrangian Lagrangian + Self-train Variational Figure 11: Comparison of MSE on regressing a two layer neural network with explanations as a noisy classifier (top) and noisy gradients (bottom). m = 1000, k = 20. For the iterative methods, T = 10. Results are averaged over 5 seeds. ϵ represents the variance of the noise added to the noisy classifier or noisy gradient. of labeled data, this leads to no benefits (if not worse performance of the Lagrangian approach). However, with the noisy gradient, under small amounts of noise, the restricted class of hypothesis will still capture solutions with low error. Therefore, in this case, we observe that the Lagrangian approach outperforms standard supervised learning in the case with few labeled data and matches it with sufficient labeled data. Our method outperforms or matches both methods across all settings. We consider another noisy setting, where noise has been added to the weights of a copy of the target two layer neural network. Here, we compare how this information impacts learning from the direct outputs (noisy classifier) or the gradients (noisy gradients) of that noisy copy (Figure 12). 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.3 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.5 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.7 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.3 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.5 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Epsilon : 0.7 Supervised Lagrangian Lagrangian + Self-train Variational Figure 12: Comparison of MSE on regressing a two layer neural network with explanations as a noisy classifier (top) and noisy gradients (bottom). m = 1000, k = 20. For the iterative methods, T = 10. Results are averaged over 5 seeds. ϵ represents the variance of the noise added to the noisy classifier or noisy gradient. L Additional Baselines We compare against an additional baseline of a Lagrangian-regularized model + self-training on unlabeled data. We again note that this is not a standard method in practice and does not naturally fit into a theoretical framework, although we present it to compare against a method that uses both explanations and unlabeled data. 10 20 30 40 50 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational 10 20 30 40 50 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational 10 15 20 25 30 35 40 45 50 Number of Labeled Data (n) Supervised Lagrangian Lagrangian + Self-train Variational 10 15 20 25 30 35 40 45 50 Number of Labeled Data (n) Supervised Lagrangian Lagrangian + Self-train Variational Figure 13: Comparison of MSE on regressing a linear model (top left) and two layer neural network (top right) with gradient explanations. m = 1000, k = 20. For the iterative methods, T = 2. Results are averaged over 5 seeds. Comparison of classification accuracy on the You Tube dataset (bottom left) and the Yelp dataset (bottom right). m = 500, k = 150. Results are averaged over 40 seeds. We observe that our method outperforms this baseline (Figure 13), again especially in the settings with limited labeled data. We observe that although this new method indeed satisfies constraints, when performing only a single round of self-training, it no longer satisfies these constraints as much. Thus, this supports the use of our method to perform multiple rounds of projections onto a set of EPAC models. M Experimental Details For all of our synthetic and real-world experiments, we use values of m = 1000, k = 20, T = 3, τ = 0, λ = 1, unless otherwise noted. For our synthetic experiments, we use d = 100, σ2 = 5. Our two layer neural networks have hidden dimensions of size 10. They are trained with a learning rate of 0.01 for 50 epochs. We evaluate all networks on a (synthetic) test set of size 2000. For our real-world data, our two layer neural networks have a hidden dimension of size 10 and are trained with a learning rate of 0.1 (You Tube) and 0.1 (Yelp) for 10 epochs. λ = 0.01 and gradient values computed by the smoothed approximation in [? ] has c = 1. Test splits are used as follows from the You Tube and Yelp datasets in the WRENCH benchmark [46]. We choose the initialization of our variational algorithm h0 as the standard supervised model, trained using gradient descent. N Ablations We also perform ablation studies in the same regression setting as Section 6. We vary parameters that determine either the experimental setting or hyperparameters of our algorithms. N.1 Number of Explanation-annotated Data First, we vary the value of k to illustrate the benefits of our method over the existing baselines. 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational Figure 14: Comparison of MSE on regressing a two layer neural network over different amounts of explanation-annotated data k. m = 1000. For the iterative methods, T = 10. Results are averaged over 5 seeds. We observe that our variational approach performs much better than a simple augmented Lagrangian method, which in turn does better than supervised learning with sufficiently large values of k. Our approach is always better than the standard supervised approach. We also provide results for how well these methods satisfy these explanations over varying values of k. N.2 Simpler Teacher Models Can Maintain Good Performance As noted before, we can use simpler teacher models to be regularized into the explanation-constrained subspace. This can lead to overall easier optimization problems, and we synthetically verify the impacts on the overall performance. In this experimental setup, we are regressing a two layer neural network with a hidden dimension size of 100, which is much larger than in our other synthetic experiments. Here, we vary over simpler teacher models by changing their hidden dimension size. 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Variational Hidden Dimension: 2 Variational Hidden Dimension: 4 Variational Hidden Dimension: 6 Variational Hidden Dimension: 8 Variational Hidden Dimension: 10 Figure 15: Comparison of MSE on regressing a two layer neural network over simpler teacher models (hidden dimension). Here, k = 20, m = 1000, T = 10. Results are averaged over 5 seeds. We observe no major differences as we shrink the hidden dimension size by a small amount. For significantly smaller hidden dimensions (e.g., 2 or 4), we observe a large drop in performance as these simpler teachers can no longer fit the approximate projection onto our class of EPAC models accurately. However, slightly smaller networks (e.g., 6, 8) can fit this projection as well, if not better in some cases. This is a useful finding, meaning that our teacher can be a smaller model and get comparable results, showing that this simpler teacher can help with scalability without much or any drop in performance. N.3 Number of Unlabeled Data As a main benefit of our approach is the ability to incorporate large amounts of unlabeled data, we provide a study as we vary the amount of unlabeled data m that is available. When varying the amount of unlabeled data, we observe that the performance of self-training and our variational objective improves at similar rates. 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational Figure 16: Comparison of MSE on regressing a two layer neural network over different values of m. k = 20, T = 10. Results are averaged over 5 seeds. N.4 Data Dimension We also provide ablations as we vary the underlying data dimension d. As we increase the dimension d, we observe that the methods seem to achieve similar performance, due to the difficulty in modeling the high-dimensional data. Also, here gradient information is much harder to incorporate, as the input gradient itself is d-dimensional, so we do not see as much of a benefit of our approach as d grows. 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Supervised Lagrangian Lagrangian + Self-train Variational Figure 17: Comparison of MSE on regressing a two layer neural network over different underlying data dimensions d. m = 1000, k = 20. For the iterative methods, T = 10. Results are averaged over 5 seeds. N.5 Hyperparameters First, we compare the different approaches over different values of regularization (λ) towards satisfying the explanation constraints. Here, we compare the augmented Lagrangian approach, the self-training approach, and our variational approach. 20 40 60 80 Number of Labeled Data (n) MSE (log scale) lambda : 0.1 20 40 60 80 Number of Labeled Data (n) MSE (log scale) 20 40 60 80 Number of Labeled Data (n) MSE (log scale) lambda : 10 20 40 60 80 Number of Labeled Data (n) MSE (log scale) lambda : 100 Supervised Lagrangian Lagrangian + Self-train Variational Figure 18: Comparison of MSE on regressing a two layer neural network over different values of λ. m = 1000, k = 20. For the iterative methods, T = 10. Results are averaged over 5 seeds. We observe that there is not a significant trend as we change the value of λ across the different methods. Since we know that our explanation is perfect (our restricted EPAC class contains the target classifier), increasing the value of λ should help, until this constraint is met. 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Variational, T = 3 Variational, T = 5 Variational, T = 7 Variational, T = 10 20 40 60 80 Number of Labeled Data (n) MSE (log scale) Variational, t = 0 Variational, t = 0.1 Variational, t = 0.2 Variational, t = 0.3 Variational, t = 0.4 Figure 19: Comparison of MSE on regressing a two layer neural network over different values of T (left) and τ (right) in our variational approach. m = 1000, k = 20, τ = 10, T = 10, unless noted otherwise. Results are averaged over 5 seeds. Next, we compare different hyperparameter settings for our variational approach. Here, we analyze trends as we vary the values of T (number of iterations) and τ (threshold before adding hinge penalty). We note that the value of τ does not significantly impact the performance of our method while increasing values of T seems to generally benefit performance on this task. O Social Impacts While our proposed method has the potential to improve performance and efficiency in a variety of applications, our method could introduce new biases or reinforce existing biases in the data used to train the model. For example, if our explanations constraints are poorly specified and reflect biased behavior, this could lead to inaccurate or discriminatory predictions, which could have negative impacts on individuals or groups that are already marginalized. Therefore, it is important to note that these explanation constraints must be properly analyzed and specified to exhibit the desired behavior of our model.