# on_regularization_and_inference_with_label_constraints__f3330bcd.pdf On Regularization and Inference with Label Constraints Kaifu Wang 1 Hangfeng He 2 Tin D. Nguyen 3 Piyush Kumar 4 Dan Roth 1 Prior knowledge and symbolic rules in machine learning are often expressed in the form of label constraints, especially in structured prediction problems. In this work, we compare two common strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference, by quantifying their impact on model performance. For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints. However, its preference for small violations introduces a bias toward a suboptimal model. For constrained inference, we show that it reduces the population risk by correcting a model s violation, and hence turns the violation into an advantage. Given these differences, we further explore the use of two approaches together and propose conditions for constrained inference to compensate for the bias introduced by regularization, aiming to improve both the model complexity and optimal risk. 1. Introduction Domain knowledge in machine learning is often framed as constraints on the output label space. Such label constraints have been widely identified in natural language processing tasks (Roth & Yih, 2004; Martins et al., 2009b; Ning et al., 2018a; Lu et al., 2022) and studied in the context of structured prediction (Punyakanok et al., 2005; Chang et al., 2007; 2008; Ganchev et al., 2010; Chang et al., 2012; Pan et al., 2020). For example, in temporal reasoning (Ning et al., 2018a) where the model is asked to label the relations 1University of Pennsylvania, Philadelphia, PA, USA 2University of Rochester, Rochester, NY, USA (Part of the work done while at the University of Pennsylvania.) 3Massachusetts Institute of Technology, Cambridge, MA, USA 4Systems and Technology Research, Woburn, MA USA. Correspondence to: Piyush Kumar , Dan Roth . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ( before or after ) among a set of events, the assigned labels will need to satisfy a transitivity constraint which means, for example, the facts that an event E1 is after E2 and that E2 is after E3 imply that E1 is after E3. The central question is how to encode such a constraint into a learning algorithm to ensure better performance and generalization of the learned model. Practitioners have developed two techniques to encode a label constraint in a machine learning pipeline. The first, called regularization with constraints, penalizes a model for its violation of the constraint in addition to the classification loss (Ganchev et al., 2010; Li et al., 2019). The second, called inference with constraints, modifies prediction rules directly by enforcing strictly constrained inference (Roth & Yih, 2004; Scholak et al., 2021) or balancing the original model s output with the constraint in a soft way (Chang et al., 2008; 2012). Although these two learning algorithms have been shown to be empirically successful, we are not aware of theoretical analyses that elucidate each algorithm s advantages or disadvantages in comparison with the other one. Natural questions include, how do these two differ in their impact on the learned model? Moreover, in practice, the constraints could be noisy i.e. (Hu et al., 2016; Wang et al., 2021). In such cases, do they still improve the model performance? If so, by how much? Focusing on multiclass classification with label constraints, we compare regularization with constraints and constrained inference. For each algorithm, we quantify its optimal risk (aka approximation error) and its generalization gap (aka estimation error). Specifically, in Section 3, we show that regularization with constraints achieves a smaller generalization error by reducing the model complexity but will introduce a bias towards a suboptimal model if the risk minimizer and violation minimizer does not coincide. In Section 4, we study a broad family of constrained inference model called Constrained Conditional Model (CCM) (Chang et al., 2008; 2012) and point out that the constrained inference could reduce the risk of a model if and only if the model violates the constraint more than the true data distribution. This further suggests finding models with higher violation, which contrasts the learning objective used in regularization that discourages violation. Given these contrasts, we further On Regularization and Inference with Label Constraints study the combination and interaction of the two methods in Section 5 and describe how constrained inference could compensate for the bias introduced by regularization. To the best of our knowledge, our analysis is the first to provide a theoretical view on comparing the two approaches. We believe in the importance of this comparison and hope to bring this problem to the attention of the machine learning community. In summary, our contributions include: 1. We provide an error bound (Theorem 3.6) that describes the tradeoff between the generalization gap and the optimal risk when performing regularization with constraints. 2. We propose a sufficient and necessary condition (Theorem 4.3) for constrained inference to improve a model by quantifying its reduction in risk. Based on this, we further argue that constrained inference, when used at training time, implicitly modifies the training objective in an opposite direction as in the regularization approach (Proposition 4.6). 3. We study the combination of regularization and constrained inference, and propose sufficient (Theorem 5.1) as well as necessary (Theorem 5.2) conditions for the combined algorithm to achieve improvement in both optimal risk and model complexity. Proofs of all the theoretical results are in the appendix. 2. Preliminaries Our goal is to learn a mapping from the instance space X to the output space Y. The learner has access to a set of labeled training data SL of size m L, which contains i.i.d. samples of a distribution P on X Y. The marginal distribution of X is denoted as PX. In this work, we assume the ground truth label associated with x X is generated by a deterministic mapping yora : X Y (ora is short for oracle). We also denote the true label as yora when the context is clear. Model. The scoring class F contains scoring functions f : X Y R. We will also call a f F a classifier. Let Y be the |Y|-dimensional probability simplex. Each scoring function induces a probabilistic prediction Pf( |x) Y by performing softmax inference as P(y|x) exp(f(x, y)). Loss Function. The prediction of f at x is evaluated by the classification error (or ℓ1 loss) L(x, yora, f) := 1 Pf(yora|x), which is half the ℓ1 distance the between the one-hot distribution eyora and Pf on Y. It can also be viewed as a smoothed version of the standard zero-one loss in the sense that limt L(x, yora, tf) = 1{argmaxy Y f(x, y) = yora}. More background on the definition of the ℓ1 loss are provided in Appendix A. A scoring function f is evaluated by its risk R(f) := E[L(x, yora, f)]. The empirical estimate of the risk using the labeled examples in SL is denoted as b R(f, SL). We also consider the cross-entropy surrogate loss defined as LCE(x, yora, f) := log Pf(yora|x) and refer its expectation RCE(f) = E[LCE(x, yora, f)] as cross-entropy risk. Label constraint. A label constraint (or constraint for short) is a deterministic mapping C : X 2Y { }. Namely, C maps an instance x to a nonempty subset of Y, which may or may not contain the true label yora(x). In particular, we say a constraint C is noise-free if P(yora C(x)) = 1. Otherwise, C is said to be a noisy constraint and its noise rate is denoted as Vora := P(yora(x) / C(x)). Violation. A constraint C is equipped a violation function, which is an indicator function v C(x, y) = 1{y / C(x)}. We also overload the notation v and define the violation of a classifier f at an instance x as v C(x, f) := 1 Pf(C(x)|x) = y / C(x) Pf(y|x). Its expectation is VC(f) := E[v C(x, f)]. We elide the subscript C and write them as v(x, y), v(x, f) and V (f) when the context is clear. Similar to the classification error, we consider a cross-entropy surrogate of the violation function defined as v CE(x, f) := log Pf(C(x)) and its expectation VCE(f) = E[v CE(x, f)]. Rademacher complexity. We use the following version of Rademacher complexity that is adopted from Cortes et al. (2016) to characterize the generalization ability of the scoring space of multiclass classifiers F: Definition 2.1 (Rademacher complexity of F). The empirical Rademacher complexity of scoring class F with respect to a set S = {xi}m i=1 that contains m samples of the instance is defined as b Rm(F; S) := 1 m i=1 y Y ϵi,yf(xi, y) where ϵ = (ϵi,y)i [m],y Y are independent Rademacher random variables, each of which is uniformly distributed over { 1, +1}. The Rademacher complexity of scoring class F is the expectation of the empirical version: Rm(F) := ES Pm X [ b Rm(F; S)] (2) This definition of Rademacher complexity is a special case of the factor graph complexity proposed by Cortes et al. (2016), which is defined for more general structured prediction models. It is hence possible to extend our results of the generalization bounds to structured models by replacing the Rademacher complexity with factor graph complexity. In this work, we focus on multiclass classifiers for the simplicity of presentation. On Regularization and Inference with Label Constraints 3. Regularization with Constraints In a standard machine learning algorithm, the learner receives a set of labeled data SL m=1(X Y)m and finds the empirical risk minimizer, which is defined as argminf F b R(f; SL). In this section, we consider a method that modifies this learning objective by adding a regularization term defined with the constraint C. Precisely, we consider minimizing an augmented objective defined as Lρ(f) := R(f) + ρV (f) where ρ 0 is a fixed tradeoff parameter. The idea of regularizing the model by adding a penalty for the violation of the constraints on an unlabeled dataset is widely adopted in the literature. In particular, the cross entropy violation is known as the semantic loss (Xu et al., 2018) in the context of logical constraints. Other designs of the regularization term include using the KL-divergence on the probability space in the posterior regularization algorithm (Ganchev et al., 2010) and using the t-norms from fuzzy logic (Li et al., 2019). We will show this algorithm improves the generalization error by reducing the complexity of the scoring space (Theorem 3.6), but in general leads to a larger classification risk in the long run (Proposition 3.2), thus resulting in a tradeoff between estimation and approximation errors. 3.1. Semi-supervised Regularization with Constraints We consider a semi-supervised approach where the learner has access to an unlabeled dataset SU that contains m U independent samples of the instance X, resulting in the following definition. Definition 3.1 (ERVM). Given a labeled dataset SL of size m L and an unlabeled dataset SU of size m U, a scoring space F and a tradeoff parameter ρ 0, we define and denote the empirical risk and violation minimizer (ERVM) as: bfρ(SL, SU) := argmin f F 1 m L (x,y) SL L(x, y, f) m U x SU v C(x, f) We also denote the expected version as: fρ := argmin f F R(f) + ρVC(f). (4) For example, with our notation, bf0 is the ERM and f is the minimizer of the expected violation function. Notice that the minimizer in general is non-unique. Therefore, when we state any proposition that is related to fρ or bfρ, we mean the proposition will hold for any of the minimizers. 3.2. Deviation from The Optimal Risk In this section, we study how the risk of the minimizer fρ will deviate from the optimal risk in F. The reason that we are interested in bounding R(fρ) is that in general the minimizer R(fρ) is non-unique and may have different values of risks. Therefore, to describe the risk of ERVM in the long run (in Theorem 3.4), we provide an upper bound for all the possible risks of fρ. Proposition 3.2 (Deviation from the optimal risk). For any constraint C and ρ 0, the following holds. R(f0) R(fρ) R(f0) + ρ(V (f0) V (f )). (5) The same relation also holds for the empirical estimates b R and b V . Moreover, for any ρ > 0, there exists a scoring space and data distribution so that the RHS can be reached even with a noise-free constraint C. This result shows the minimizer of the regularized objective in general has a suboptimal risk over F. On the other hand, if the risk minimizer is simultaneously a violation minimizer, i.e., V (f0) = V (f ), this relation implies consistency, i.e., R(fρ) = R(f0). This quantity V (f0) can be small when the noise rate Vora is small and the model is expressive enough (e.g., a deep neural net) to approximate the true model. 3.3. Generalization Bounds Now we discuss how regularization could reduce the complexity of the hypothesis class. The first step is to show that the violation of the target hypothesis is not too large. In particular, the following bound is a direct consequence of minimizing the regularized objective: Lemma 3.3 (Regularization implies small violation). Let fρ be the regularized learning objective defined as in (4). If the minimum violation in F is upper bounded by a known constant u 0, i.e., V (f ) u, then V (fρ) 1/ρ + u. The upper bound u can be set to arbitrarily small by adding a baseline model defined as ft(x, y) = t 1{y C(x)} and driving t to infinite. This construction is possible due to the fact that the mapping C is known to the learner. The benefits of knowing C will be further explored in Section 4 when we discuss inference with constraints. For any B 0, we let FB := {f F|V (f) B} be the set of classifiers with small violation. From the above discussion, we know that the target hypothesis fρ will lie in a smaller space Fu+1/ρ, which is characterized by the violation function and hence can be identified only with unlabeled data. To this end, we describe how the violation as well as the risk can be estimated with data. Lemma 3.4 (Generalization gap of ℓ1 loss and violation). Given a labeled dataset SL of size m L, for any δ > 0, with On Regularization and Inference with Label Constraints probabilistic at least 1 δ, the following inequality holds uniformly for f F: R(f) b R(f; SL) + Rm L(F) + Given a unlabeled dataset SU of size m U, for any δ > 0, with probabilistic at least 1 δ, the following inequality holds uniformly for f F: V (f) b V (f; SU) + Rm U(F) + The proof of this result relies on a contraction lemma established in Cortes et al. (2016), which was used to analyze the argmax inference with margin losses. Our analysis extends their results to softmax inference, which may be of independent interest. Furthermore, if the size of the constrained set C(x) is a constant, namely |C(x)| = c0 < c = |Y| for all x X, then the Rademacher complexity term of equation (7) can be improved to c0 Rm U(F) (see the discussion in the proof). This term is symmetric with the transformation c0 7 c c0, due to the fact that estimating the violation VC of a constraint C is equivalent to estimating VY C. In particular, when c0 < c/2, if the constraint is more restrictive and informative (so that c0 is small), it can be more difficult to estimate the violation. Remark 3.5 (Relation to cross-entropy loss). Assuming limm Rm(F) = 0, this result implies Lρ can be approximated by its empirical version b Lρ with sufficient amount of data. On the other hand, since b Lρ is upper bounded by its cross-entropy surrogate b RCE + ρb VCE, we further have that Lρ(f) b RCE(f, SL) + ρb VCE(f, SU) + om L,m U(1) (8) where om L,m U(1) converges to 0 as m L, m U . Therefore, in practice one can minimize this upper bound by solving the convex surrogate problem min f F b RCE(f, SL) + ρb VCE(f, SU). (9) where b RCE(f, SL) and VCE(f, SU) are the empirical average of the cross-entropy loss and violation. Finally, using these results, we bound the risk of the classifier learned by ERVM. For simplicity, we will denote the generalization gap B(δ, m, F) := Rm(F) + 2 q Theorem 3.6 (Error bound). We have with probability at least 1 6δ that R( bfρ) R(f0) + ρV (f0) ρV (f ) + Rm L(F1/ρ+u+B(δ,m U,F)) + ρRm U(F1/ρ+u+B(δ,m U,F)) where R( ) is the Rademacher complexity defined in (2). Proof sktech. First, we show bfρ and fρ both lie in the subspace F1/ρ+u+B(δ,m U,F) with high probability since the violation can be well-approximated, according to Lemma 3.4. Then, the gap between the objective L(fρ) and L( bfρ) is controlled by the Rademacher complexity of F1/ρ+u+B(δ,m U,F). Finally, we use the inequalities established in Lemma 3.2 to further upper bound the term L(fρ) using the risk and violation of f0. Using the same proof technique, this result can be extended to other choices of loss function as long as: (a) The loss is bounded so that the optimal regularized model has a small violation, as in Lemma 3.3. (b) The loss is Lipschitz with the model scores so that a generalization bound associated with the loss holds, as in Lemma 3.4. Reducing the generalization gap. The bound (10) contains three parts: the first line is the worst risk that can be achieved by fρ as we described in Proposition 3.2, the second and the third line is the complexity of the classifiers that have a small violation, and the last line is the errors that are independent of the model. This bound (10) is most preferable when a large set of unlabeled data is available so that the approximation errors of violations (i.e., term B(δ/2, m U, F), Rm U(F1/ρ+u+B(δ/2,m U,F)) and q 2m U ) are all small. Then, the model complexity is mainly described by the term Rm L(F1/ρ+u), which is the Rademacher complexity of a proper subset of F. In this sense, the regularization method reduces the generalization gap by reducing the model complexity of the scoring space. Tradeoff in regularization. In situations where m U is large, the tradeoff parameter ρ balances two quantities: a larger ρ leads to a smaller scoring space F1/ρ+u, but brings more bias depending on the suboptimality of f0 in violation, measured by V (f0) V (f ). The benefit of regularization is greater if fewer classifiers can achieve a violation that is close to the optimal value V (f ). We provide the following example to illustrate how the Rademacher complexity can be reduced in linear models. Example 3.7 (Logistic Regression). Consider a linear model for multiclass classification where Y = [c] and On Regularization and Inference with Label Constraints f(x, j) = w T j x with c j=1 wj 2 2 1. Suppose x Rp is distributed in the unit sphere x 2 1 with expectation E[x] = α Rp and covariance matrix σ2Ip p. Without constraint, the Rademacher complexity is upper bounded as Rm(F) p c/m as in Cortes et al. (2016) (Theorem 2). Now, consider a constraint that removes exactly one label so that C(x) [c 1]. With regularization, for sufficient small t < 1/(c + 2), we have the following bound c σ2 α 2 2 m which is strictly tighter than the standard bound. Intuitively, if x is concentrated around the origin 0, the prediction by any classifier will tend to be a uniform distribution. Therefore, a large bias and variance in x (captured by σ2 + α 2 2) help to distinguish models with different levels of violation. Compare to existing results. Previous works mostly consider a zero-one loss for both classification and violation under the assumption that the risk minimizer also achieves zero violation. Then, one can simply preclude all the classifiers f F that have nonzero empirical violations on the unlabeled dataset and find the ERM among the remaining classifiers. This approach has been theoretically studied in Balcan & Blum (2005; 2010) for binary classification and Tulab et al. (2014) in a similar manner for regression by characterizing the complexity of the reduced set of hypotheses that achieve zero violation. Conceptually, we can regard this algorithm as a special case of problem (4) when ρ = . Our study, therefore, extends previous works with a soft learning objective to multiclass classification problems. 4. Inference with Constraints An inference algorithm is a mapping F X Y. By default, we define it as the softmax inference: (f, x) 7 Pf( |x). When performing inference with constraints (or constrained inference), we modify this softmax mapping for the given function f using the additional information of C. In this section, we study the Constrained Conditional Model (CCM) (Chang et al., 2008; 2012), a broad family of models that perform inference with constraints. We show at testing time, whether CCM reduces the risk depends on whether the model s expected violation is larger than the noise rate of the constraint Vora (Theorem 4.3). In particular, when the constraint is noise-free, CCM always achieves a smaller or equal risk. Furthermore, we show better risks are achieved if the constrained inference is also performed at training time, and pursuing this optimal risk leads to a learning objective that contrasts with the one used in the regularization approach (Proposition 4.6). To make distinguishment, we will refer to a model in the original spaces F as a base model and refer to an augmented model as a constrained model. 4.1. Constrained Conditional Model CCM augments existing scoring functions using a linear combination with the violation function. Precisely, given a vanilla scoring space F, the scoring space of CCM is defined as follows. Definition 4.1 (Constrained conditional model (Chang et al., 2008; 2012)). Given a scoring space F, a constraint C and a fixed tradeoff parameter µ [0, ], the scoring space of the Constrained Conditional Model (CCM) is defined as: Fµ := {(x, y) 7 f(x, y) µv C(x, y)|f F} (12) We will also denote f µ(x, y) := f(x, y) µv C(x, y) (13) to be the augmented scoring function for a given f F. In particular, setting µ = will assign a score to any y / C(x), which implies Pf (y|x) = 0, namely forcing strictly-constrained inference. The tradeoff parameter µ allows CCM to improve the base model f despite noisy constraints, as we will discuss in detail in the following sections. Otherwise, if the noise rate is large, performing strictly-constrained inference can be harmful because it assigns 0 probability mass to any label y that is outside C(x) and hence has a classification loss L(x, yora, f ) = 1 at any x where yora / C(x). The learner can choose whether or not to perform the constrained inference either at the training time. This choice leads to the following two different approaches: 1. On-training approach: perform constrained inference both at training and testing time, and directly find the ERM over Fµ using labeled data (also known as (Inference Based Training in (Punyakanok et al., 2005)) 2. Post-training approach: first find the ERM over the vanilla F using labeled data, and then perform constrained inference at the testing time (also known as Learning Plus Inference in (Punyakanok et al., 2005)). For both approaches, the generalization ability of CCM is characterized by the complexity of Fµ. So, we first point out that CCM does not increase the Rademacher complexity. Proposition 4.2 (Rademacher Complexity of CCM). For any fixed µ 0 and m N, we have the following identity: Rm(Fµ) = Rm(F) (14) On Regularization and Inference with Label Constraints 4.2. Post-training Constrained Inference For a given and fixed classifier f (presumably trained with data), how does performing constrained inference impact the model performance? In this section, we study the change in risk when the learner chooses to augment f as a CCM f µ defined in (13). It is most convenient to characterize the risk of a CCM using the cross-entropy loss, although we will also conduct the same analysis for the hinge and ℓ1 losses, as we will point out later. To start with, for any f and µ [0, ], we let µ CE(f) := RCE(f) RCE(f µ) (15) be the difference in the risk between the base model and the CCM (the larger the better). Theorem 4.3 (Change in cross-entropy risk). We have: (a) For any fixed model f, there exists an µ0 > 0 such that RCE(f µ0) < RCE(f) if and only if V (f) > Vora (16) (b) The change in risk can be lower bounded as µ CE(f) V (f)(1 e µ) µVora (17) (c) In particular, if the constraint is noise-free, we have CE(f) = VCE(f) (18) The first result describes the sufficient and necessary condition for constrained inference to be helpful. It requires f to have a larger violation (measured by ℓ1 violation) than the true data on average so that it has the potential to be improved. This condition is easier to satisfy when the constraint is less noisy. The second result further quantifies the risk reduction as an explicit function of µ. The last result shows that in the noise-free case, the maximum risk reduction is exactly the expected violation measured by cross-entropy. Its consequences will be further discussed in the next section. Remark 4.4 (CCM with alternative losses). We present the counterparts of Theorem 4.3 for hinge loss and ℓ1 loss in the Appendix D. The information delivered by those results is consistent with Theorem 4.3 in the sense that (1) whether CCM can reduce the risk depends on the comparison between the violation of the original model and the oracle. (2) the reduction can be described or lower bounded by some measures of the violation. The drawback of the hinge loss is its non-smoothness due to the discontinuity of the argmax inference. The drawback of the ℓ1 loss is that the range of µ such that R(f µ) R(f) can be disconnected and difficult to describe. Therefore, we provide weaker results by deriving only sufficient or necessary conditions for CCM to reduce the risks. As an application of Theorem 4.3, we derive a sufficient condition under which CCM achieves smaller risks. Corollary 4.5 (Choice of µ). Assuming V (f) Vora, then RCE(f µ) RCE(f) if the following condition holds: µ W( η/eη) + η (19) where η := V (f)/Vora is the relative violation rate and W is the Lambert W function whose value W(t) is defined to be the solution to the equation wew = t of w. The RHS of (19) increases with η and vanishes as η 1. In particular, when the constraint is noise-free, one should encourage strictly-constrained inference and set µ = . We also provide a plot of the RHS in the proof in the appendix. 4.3. On-training Constrained Inference In this subsection, we study the on-training approach where we perform constrained inference both at the training and testing time. We use the results we established in the last subsection to describe the learning objective of the on-training approach, and argue that it achieves better risks than the post-training approach. Based on this, we further show that minimizing the cross entropy over CCM encourages a large violation of the base model, which contrasts the learning objective (9) that is used in regularization. We provide a simplified analysis for the noise-free setting where we choose µ = and perform strictly-constrained inference. Then, the on-training approach aims to find the optimal (in terms of cross entropy) base model as follows: fon := argmin f F RCE(f ) (20) (recall f means performing strictly-constrained inference with f) We characterize the behavior of fon with the following results, which are direct corollaries of Theorem 4.3. Proposition 4.6 (Learning Optimal CCM; Post-training vs On-training). Assuming C is noise-free, we can reformulate the learning objective (20) as fon = argmin f F RCE(f) VCE(f) (21) A fundamental difference. Surprisingly, the reformulated learning objective (21) is opposite to the surrogate regularized objective defined in (9) in their attitudes towards violations. This contrast suggests a fundamental difference between regularization and constrained inference: the regularization method views violation as a bad thing and it precludes classifiers with substantial violations. But constrained inference corrects a model from its violation, so a large violation means a great potential to be improved. On Regularization and Inference with Label Constraints On-training vs post-training. Loosely speaking, this result also suggests that in general, the best constrained model is not the constrained best model. To be more precise, suppose we perform post-training constrained inference for the cross-entropy risk minimizer in the vanilla model, i.e., fpost := argminf F RCE(f). Then, we can reformulate the definition of fpost as fpost := argmin f F (RCE(f) VCE(f)) | {z } objective in (21), post-training risk +VCE(f) (22) which can be regarded as a regularized version of (21). Therefore, similar to Proposition 3.2, we can argue that the risk minimizer fpost over F, as a base model of CCM, contains a bias towards a higher risk than the on-training method s as follows: RCE(f on) RCE(f post) RCE(fon) min f F VCE(f) (23) The proof is included in the proof of Proposition 4.6. Computational considerations. In practical structured prediction problems where the output is sequential or graphical, performing constrained inference during training time is typically expensive due to the complexity of the constraints. For example, as pointed out by Xu et al. (2018), when the constraint is defined by a logical expression over several output variables, computing the probability of constraint being satisfied corresponds to the problem of weighted model counting (WMC) and is #P-complete (Roth, 1996). Therefore, to implement the on-training approach in practice, one can alternatively use approximate inference to ensure tractability. For example, strictly constrained inference, formulated as Integer Linear Programming (Roth & Yih, 2004), can be further relaxed as Linear Programming (Martins et al., 2009a). Another example is amortized inference (Chang et al., 2015), which accelerates the convergence to the optimal model while only performing exact inference in every τ > 1 iterations. Compare to existing results. There has been limited theoretical work discussing the impact of performing constrained inference. The most related one is Punyakanok et al. (2005), which derives VC-style generalization bounds for linear structured models to argue that (1) performing strictly constrained inference in a post-training manner (Learning Plus Inference in the paper) improves the model performance and (2) the on-training approach (Inference Based Training in the paper) further reduces the error in the long run. Our approach directly analyses the classification risk and extends the comparison to noisy constraints and softconstrained inference with CCM. Figure 1. A summary of the established results, as motivations to the problems of this section. In Section 5.1, we describe how CCM can reduce the additional risk introduced by regularization. In section 5.2, we claim that the decrease of violation with regularization can make post-training constrained inference unnecessary. 5. Regularization with Constrained Inference We have seen that regularization and constrained inference have different impacts on the generalization gap and the risk. On one hand, CCM has an equal Rademacher complexity (Proposition 4.2) as the original model R(F), which can be reduced by regularization. So, performing regularized algorithm to CCM also reduces the generalization gap. On the other hand, their impacts on the risks are contradicting, as summarized in figure 1. In this section, we aim to describe how these impacts can interact with each other by applying our established results to explore the usage of these two methods together. We show both positive and negative results for the combination. On one hand, we propose sufficient conditions under which the bias introduced by regularization can be compensated by performing constrained inference (Proposition 5.1). On the other hand, we study if post-training constrained inference can reduce the risk of the optimal classifier fρ. We show with a noisy constraint, choosing a large value of ρ in the regularized objective (4) will make CCM incapable to reduce the risk (Proposition 5.2). 5.1. CCM Compensates for Regularization Bias As the red part of Figure 1 summarizes, we have shown that the regularization and constrained inference have contradicting influences on the risk. Moreover, the regularization bias is controlled by the violation of the risk minimizer (Proposition 3.2), which can be reduced by constrained inference. This suggests the possibility for CCM to reduce the additional risk introduced by regularization. On Regularization and Inference with Label Constraints We formally describe this phenomenon by considering the following combination: an on-training approach that aims to find the minimizer of the following regularized surrogate objective over the CCM Fµ: f µ := argmin g Fµ RCE(g) + ρVCE(g) (24) Recall that RCE(fpost) is the minimum cross-entropy risk that can be achieved in F. We show that unlike the vanilla regularized objective (4), it is possible for this algorithm to achieve a smaller risk than RCE(fpost) as follows. Theorem 5.1 (Regularization with on-training constrained inference). If CCM improves fpost so that µ CE(fpost) > 0, then letting ρ < VCE(fpost) µVora VCE(f µ post) 1 (25) will imply RCE(f µ ) < RCE(fpost). This result shows a small choice of ρ allows the regularized optimizer f µ to achieve better cross-entropy. A less noisy constraint allows more choices of ρ to make this happen. In particular, when the constraint is noise-free, since VCE(f µ post) 0 as µ , driving µ to will make R(f µ ) < R(fpost) for all ρ > 0. As a cost, regularization will be less effective in reducing the Rademacher complexity with a large value of µ. In the extreme case, all the classifiers in F make zero violation, and hence cannot be distinguished by the regularization objective. 5.2. Post-regularized-training Constrained Inference Finally, as the blue part of Figure 1 summarizes, we have shown that post-training inference is beneficial only if the average violation of f is larger than Vora (Theorem 4.3). However, the minimizer of the regularized objective fρ tends to have a small violation (Proposition 3.3) scaled with 1/ρ. Therefore, it is possible that choosing a large value of ρ will make post-training incapable to reduce the risk with a noisy constraint. Formally, assuming a model is already trained with the vanilla regularized ℓ1 objective as in (4), we have the following holds. Theorem 5.2 (When post-training inference is not helpful for regularized model). Recall V (f ) is the minimal expected violation that can be achieved by F. If Vora V (f ) and ρ 1 Vora V (f ) (26) then the minimizer fρ of the regularized objective (4) will not be improved by post-training constrained inference for any µ (0, ] in the sense that RCE(fρ) RCE((fρ)µ). The RHS of (26) shrinks with a larger noise rate Vora and smaller V (f ). Intuitively, a more noisy constraint is less helpful (Theorem 4.3), while a small value of V (f ) allows fρ to violate less (Proposition 3.3) and hence gains fewer benefits from constrained inference (Theorem 4.3). As a consequence, with a noisy constraint, choosing a large ρ in the regularized objective will make post-training constrained inference unnecessary or even harmful. 6. Related Works Regularization with constraints. In the context of structured prediction, the Posterior Regularization (PR) framework (Ganchev et al., 2010) proposed to regularize the loglikelihood by adding a distance of the probabilistic prediction to the constrained subspace of distributions. The Co DL algorithm (Chang et al., 2007; 2012) is a semi-supervised algorithm that repetitively assigns constrained pseudo-labels to the unlabeled dataset and uses pseudo-labels to retrain the model. Co DL and PR are further unified in Samdani et al. (2012) as special cases of a parameterized EM algorithm. More recent works have proposed injecting logical constraints into deep models by augmenting the training objective with explicitly defined violation functions, such as the semantic loss (Xu et al., 2018), the DL2 loss (Fischer et al., 2019) and the inconsistency loss (Li et al., 2019), which motivate our theoretical formulation in (4). Inference with constraints. The idea of injecting prior knowledge directly into a predictive model dates back to Roth & Yih (2004), which formulates the problem of inference with hard constraints as Integer Linear Programming (ILP). The idea of constrained inference has been followed and developed by NLP researchers and empirically shown to be effective in various problems such as summarization (Clarke & Lapata, 2008), temporal reasoning (Ning et al., 2018b), semantic parsing (Scholak et al., 2021) and text generation (Lu et al., 2022). Chang et al. (2008; 2012) further defines the CCM to incorporate soft constraints into linear models. Another related work is Enrique Sucar et al. (2014), which uses Bayesian networks to model the label correlations and define an order to the labels. The order information is then taken as extended features at inference time. Theoretically, Punyakanok et al. (2005) provides a comparison between the on-training and post-training constrained inference using VC-style error bounds. Semi-supervised learning theory. Several theoretical semi-supervised learning frameworks such as Balcan & Blum (2005; 2010) and Tulab et al. (2014) illustrate how hard constraints on the hypothesis space could reduce the generalization error. A detailed comparison can be seen in the discussion at the end of Section 3. On Regularization and Inference with Label Constraints Learning with partial labels. The problem of learning with constraints is closely related to the problem of learning from partial labels (also known as superset labels) (Cour et al., 2011; Liu & Dietterich, 2014; Cid-Sueiro, 2012; Cabannes et al., 2020) where each instance x in the dataset is assigned with a partial label s which also takes value in 2Y. The difference is that the constraint mapping itself is known to the learner and hence can be encoded in the inference algorithm directly, for example, via the CCM. Another difference is that the partial labels are typically more informative and can guarantee learnability alone (Liu & Dietterich, 2014; Wang et al., 2020). In contrast, the constraints that appear in practice typically provide only side information and need to be used with gold labels together. 7. Conclusion and Future Works In this paper, we presented a theoretical study of two methods to encode label constraints into a learning system: regularization and constrained inference. We compared these two approaches by quantifying their impact on the optimal risk as well as the generalization error. Our study revealed that the success of these two approaches replies on different data assumptions: the regularization method requires the optimal classifier in the model to have a small violation while constrained inference requires the true data to have a small violation. We further elucidated the detrimental consequences that arise when these assumptions fail to hold. Finally, we demonstrate how their impacts on the model can interact when used together. We have focused on multiclass classification, aiming to provide a starting point for understanding the different mechanisms of the two methods. For future work, we will extend the discussion to structured prediction problems where complex constraints are naturally defined. In particular, while the presence of constraints can improve the model performance, it also suggests a strong dependency inside the structure, which may hurt the generalization performance, as pointed out by London et al. (2016). Acknowledgements This work was partially supported by Contract FA8750-192-0201 with the US Defense Advanced Research Projects Agency (DARPA). Approved for Public Release, Distribution Unlimited. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. This work was also partially sponsored by the Army Research Office and was accomplished under Grant Number W911NF-20-1-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was also partially funded by ONR Contract N00014-19-1-2620. Balcan, M.-F. and Blum, A. A PAC-Style Model for Learning from Labeled and Unlabeled Data. In Proc. of the ACM Conference on Computational Learning Theory (COLT), 2005. Balcan, M.-F. and Blum, A. A discriminative model for semi-supervised learning. Journal of the ACM (JACM), 57:1 46, 2010. Cabannes, V., Rudi, A., and Bach, F. Structured prediction with partial labelling through the infimum loss. In ICML, pp. 1230 1239, 2020. Chang, K.-W., Upadhyay, S., Kundu, G., and Roth, D. Structural Learning with Amortized Inference. In Proc. of the Conference on Artificial Intelligence (AAAI), 2015. URL http://cogcomp.org/papers/CUKR15.pdf. Chang, M.-W., Ratinov, L., and Roth, D. Guiding Semi Supervision with Constraint-Driven Learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 280 287, Prague, Czech Republic, 6 2007. Association for Computational Linguistics. URL http://cogcomp.org/papers/ Chang Ra Ro07.pdf. Chang, M.-W., Ratinov, L., Rizzolo, N., and Roth, D. Learning and Inference with Constraints. In Proc. of the Conference on Artificial Intelligence (AAAI), 7 2008. URL http://cogcomp.org/papers/CRRR08.pdf. Chang, M.-W., Ratinov, L., and Roth, D. Structured Learning with Constrained Conditional Models. Machine Learning, 88(3):399 431, 6 2012. URL http: //cogcomp.org/papers/Chang Ra Ro12.pdf. Cid-Sueiro, J. Proper losses for learning from partial labels. In Neur IPS, pp. 1574 1582, 2012. Clarke, J. and Lapata, M. Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research (JAIR), 2008. Cortes, C., Kuznetsov, V., Mohri, M., and Yang, S. Structured Prediction Theory Based on Factor Graph Complexity. In NIPS, 2016. On Regularization and Inference with Label Constraints Cour, T., Sapp, B., and Taskar, B. Learning from partial labels. Journal of Machine Learning Research, 12: 1501 1536, 2011. ISSN 1532-4435. Enrique Sucar, L., Bielza, C., Morales, E. F., Hernandez Leal, P., Zaragoza, J. H., and Larra naga, P. Multi-label classification with bayesian networkbased chain classifiers. Pattern Recognition Letters, 41:14 22, 2014. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2013.11.007. URL https://www.sciencedirect.com/ science/article/pii/S0167865513004303. Supervised and Unsupervised Classification Techniques and their Applications. Fischer, M., Balunovic, M., Drachsler-Cohen, D., Gehr, T., Zhang, C., and Vechev, M. T. DL2: Training and Querying Neural Networks with Logic. In International Conference on Machine Learning, 2019. Ganchev, K., Grac a, J., Gillenwater, J., and Taskar, B. Posterior regularization for structured latent variable models. Journal of Machine Learning Research (JMLR), 2010. Hu, Z., Ma, X., Liu, Z., Hovy, E., and Xing, E. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2410 2420, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1228. URL https://aclanthology.org/P16-1228. Li, T., Gupta, V., Mehta, M., and Srikumar, V. A logicdriven framework for consistency of neural models. Ar Xiv, abs/1909.00126, 2019. Liu, L.-P. and Dietterich, T. G. Learnability of the superset label learning problem. In ICML, pp. 1629 -1637, 2014. London, B., Huang, B., and Getoor, L. Stability and Generalization in Structured Prediction. Journal of Machine Learning Research, 17(221):1 52, 2016. URL http: //jmlr.org/papers/v17/15-501.html. Lu, X., Welleck, S., West, P., Jiang, L., Kasai, J., Khashabi, D., Le Bras, R., Qin, L., Yu, Y., Zellers, R., Smith, N. A., and Choi, Y. Neuro Logic a*esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 780 799, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main. 57. URL https://aclanthology.org/2022. naacl-main.57. Martins, A., Smith, N., and Xing, E. Concise integer linear programming formulations for dependency parsing. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 342 350, Suntec, Singapore, August 2009a. Association for Computational Linguistics. URL https://aclanthology.org/P09-1039. Martins, A., Smith, N. A., and Xing, E. Concise integer linear programming formulations for dependency parsing. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), 2009b. Meshi, O., London, B., Weller, A., and Sontag, D. Train and Test Tightness of LP Relaxations in Structured Prediction. Journal of Machine Learning Research, 20(13):1 34, 2019. URL http://jmlr.org/papers/v20/ 17-535.html. Ning, Q., Feng, Z., Wu, H., and Roth, D. Joint Reasoning for Temporal and Causal Relations. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2278 2288, Melbourne, Australia, 7 2018a. Association for Computational Linguistics. URL http: //cogcomp.org/papers/Ning Fe Wu Ro18.pdf. Ning, Q., Wu, H., and Roth, D. A Multi-Axis Annotation Scheme for Event Temporal Relations. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1318 1328, Melbourne, Australia, 7 2018b. Association for Computational Linguistics. URL http://cogcomp.org/papers/ Ning Wu Ro18.pdf. Pan, X., Mehta, M., and Srikumar, V. Learning Constraints for Structured Prediction Using Rectifier Networks. Ar Xiv, abs/2006.01209, 2020. Punyakanok, V., Roth, D., au Yih, W., and Zimak, D. Learning and Inference over Constrained Output. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1124 1129, 2005. URL http://cogcomp.org/papers/PRYZ05.pdf. Roth, D. On the hardness of approximate reasoning. Artificial Intelligence, 82(1-2):273 302, 4 1996. URL http://cogcomp.org/papers/hard J.pdf. Roth, D. and Yih, W. A Linear Programming Formulation for Global Inference in Natural Language Tasks. In Ng, H. T. and Riloff, E. (eds.), Proc. of the Conference on Computational Natural Language Learning (Co NLL), pp. 1 8. Association for Computational Linguistics, 2004. URL http://cogcomp.org/ papers/Roth Yi04.pdf. On Regularization and Inference with Label Constraints Samdani, R., Chang, M.-W., and Roth, D. Unified Expectation Maximization. In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 6 2012. URL http:// cogcomp.org/papers/Samdani Ch Ro12.pdf. Scholak, T., Schucher, N., and Bahdanau, D. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9895 9901, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. emnlp-main.779. URL https://aclanthology. org/2021.emnlp-main.779. Tulab, T., hula, and Rudin, C. Generalization bounds for learning with linear, polygonal, quadratic and conic side knowledge. Machine Learning, 100:183 216, 2014. Wang, H., Zhang, H., Chen, M., and Roth, D. Learning Constraints and Descriptive Segmentation for Subevent Detection. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. URL https://cogcomp.seas.upenn. edu/papers/WZCR21.pdf. Wang, K., Ning, Q., and Roth, D. Learnability with Indirect Supervision Signals. In Proc. of the Conference on Neural Information Processing Systems (Neur IPS), 2020. URL https://cogcomp.seas. upenn.edu/papers/Wang Ni Ro20.pdf. Xu, J., Zhang, Z., Friedman, T., Liang, Y., and Van den Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In ICML, pp. 5502 5511, 2018. On Regularization and Inference with Label Constraints Appendix A. Details on Loss Function The ℓ1 loss is a smoothed alternative to the zero-one loss and has been used in the theoretical analysis for the generalization error, see, for example, in London et al. (2016) (Section 6.2). It can be related to other common loss functions as follows. As distances on the probability simplex. Let ey R|Y| be a one-hot vector with the yth coordinate be 1 and all others be 0. We then have that L(x, yora, f) := 1 Pf(yora|x) = 1 2 eyora Pf 1 Moreover, since our label space Y is of finite cardinality, we further have that 1 2 eyora Pf 1 = TV(eyora, Pf), the total variation distance. Relation to zero-one loss. By introducing a temperature parameter t R 0 to the softmax function, it is well known that limt softmax(tu) = argmax(u) for a vector u. This implies lim t L(x, yora, tf) = 1 1{argmax y Y f(x, y) = yora} = 1{argmax y Y f(x, y) = yora} which is the zero-one loss. Since performing softmax inference with temperature t can be equivalently regarded as performing softmax inference for the scoring space t F, for the simplicity of our presentation, we omit the temperature parameter in the softmax inference. Relation to cross-entropy. The total variation distance to a one-hot probability can be lower bounded by cross-entropy due to Pinsker s inequality. More directly, in our case, we have 1 p log(p) for any p [0, 1] from basic inequality. This implies L(x, y, f) LCE(x, y, f). In conclusion, the ℓ1 loss is a ℓ1 and total variation distance on the probability space, is a smoothed version of the zero-one loss, and is upper bounded by cross-entropy. It is differentiable and bounded so that we can derive generalization bounds with Rademacher complexity. Another reason that we are interested in softmax inference will be clearer in the discussion for constrained inference, where in Theorem 4.3, D.1 and D.4, the change of expected cross entropy and ℓ1 loss can be lower bounded by a smooth function. But with the argmax inference, the risk is in general not continuous and needs to be assumed to be Lipschitz to obtain similar results. B. Proofs from Section 3 B.1. Proof of Proposition 3.2 The first inequality is straightforward. For the second inequality, by definition (4) we have R(fρ) + ρV (fρ) R(f0) + ρV (f0) and V (fρ) V (f ). Combining the two above inequalities yields R(fρ) + ρV (f ) R(f0) + ρV (f0). The desired inequality follows by rearranging these terms. This argument also holds if we replace the expectations with empirical estimates. To see how the RHS bound can be reached, consider the following scoring space that contains two classifiers, f0 and f , and an instance space X that only contains one point x. Let C(x) = {yora, y }. Let f0 be such that Pf0(yora) = a (0, 1) and Pf0(y ) = b. Let f be such that Pf (yora) = a ϵ1 and Pf (y ) = b + ϵ2 so that ϵ1 < ρϵ2. Then R(f ) + ρV (f ) 1 (a ϵ1) + ρ(b ϵ2) < 1 a + ρb = R(f0) + ρV (f0) (27) which means f will be preferred to f0 by the regularized objective. On Regularization and Inference with Label Constraints B.2. Proof of Lemma 3.3 By definitions, we have ρV (fρ) R(fρ) + ρV (fρ) R(f ) + ρV (f ) 1 + ρV (f ) Therefore, we have that V (fρ) u + 1/ρ. B.3. Proof of Lemma 3.4 To prove this theorem, we need the following lemmas. The first one is a contraction inequality established in (Cortes et al., 2016). Lemma B.1 (Lemma 5 from (Cortes et al., 2016)). Let H be a set of functions mapping X to RN. Suppose Φi is µi-Lipschtz with the 2-norm, i.e., |Φi(v ) Φi(v)| µi v v 2 v, v RN (29) Then for any set of m points x1, . . . , xm X, the following inequality holds m i=1 σiΦi(h(xi)) N j=1 ϵijµihj(xi) where σis and ϵijs are independent Rademacher variables uniformly distributed over { 1, +1}. The second one computes the Lipschitz constants of the ℓ1 losses by bounding its gradient s 2-norm. Lemma B.2 (Lipschitzness). Given a scoring function f : X Y R, let f(x) = [f(x, y)]y Y R|Y| be the vector of scores for each label. For any two scoring functions f, f and data (x, y), we have that |Pf(y|x) Pf (y|x)| 2 4 f(x) f (x) 2 (31) Furthermore, for any constraint C, we have |Pf(C|x) Pf (C|x)| 1 1 + 1 |C(x)| f(x) f (x) 2 (32) where Pf(C|x) = Pf(C(x)|x) = y C(x) Pf(y|x). Proof. We start with the second claim. Suppose C(x) = Y, then Pf(C|x) = 0 for any scoring function f, so the inequality trivially holds. Next, we assume C(x) Y. Given a constraint C : X 2Y, the derivative of its violation function with respect to the score for a label y is df(x, y) = y C(x) = y C(x) Pf(y|x)1{y = y} Pf(y|x)Pf(y |x) (33) The 2-norm of the gradient of the mapping f(x) 7 Pf(y|x) is then y C(x) Pf(y|x)1{y = y} Pf(y|x)Pf(y |x) On Regularization and Inference with Label Constraints which is maximized when Pf(y|x) = 1 2|C(x)| for all y C(x) and Pf(y|x) = 1 2(Y |C(x)|) for all y / C(x) (so that Pf(C|x) = 1/2). The maximum is then y C(x) Pf(y|x)1{y = y} Pf(y|x)Pf(y |x) y C(x) Pf(y|x)Pf(y |x) |C(x)| 1 4|C(x)| 2 + |Y C(x)| 1 2|Y C(x)| 1 16|C(x)| + 1 16|Y C(x)| 1 16|C(x)| + 1 1 + 1 |C(x)| The boundedness of the gradient implies that the function f(x) 7 Pf(C|x) is Lipschitz with a Lipschitz constant 1 + 1 |C(x)|. The first claim then follows by considering the special constraint C(x) := {yora(x)} so that |C(x)| = 1. Next, we present the proof of the theorem. By standard Rademacher complexity bounds, given a labeled dataset S of size m, for any δ > 0, with probability at least 1 δ, the following inequality holds uniformly for f F: R(f) b R(f; SL) + 2Rm(H) + where H := {(x, y) 7 1 Pf(y|x) : f F} (37) By the contraction lemma and Lipschitzness, we have m i=1 σi (1 Pf(yi|xi)) m i=1 y Y ϵiy 2 4 f(x, y) = 1 2m ESEϵ m i=1 y Y ϵiyf(x, y) This implies R(f) b R(f; SL) + Rm(F) + The proof for the generalization bound of violation follows from the same argument. In particular, if the size of the constrained set C(x) is a constant, namely |C(x)| = c0 < c = |Y| for all x X, then from Equation (35), we know that the mapping x 7 1 Pf(y|x) is Lipschitz with a Lipschitz constant 1 1 c0 + 1 c c0 . So in this case, the generalization bound for the violation function can be improved as V (f) b V (f; SU) + 1 c0 + 1 c c0 Rm U(F) + On Regularization and Inference with Label Constraints B.4. Proof of Theorem 3.6 Step 1. Showing the expected violation of bfρ is bounded. First, we have with probability 1 δ, ρb V ( bfρ) b R( bfρ) + ρb V ( bfρ) b R(f ) + ρb V (f ) 1 + ρb V (f ) where the last step follows by applying Hoeffding s inequality to b V (f ). This result implies b V ( bfρ) 1 Second, Theorem 3.4 claims that with probability 1 δ, the following inequality holds: V ( bfρ) b V ( bfρ) Rm U(F) + Putting these two inequalities together using union bound, we know with probability 1 2δ, ρ + u + Rm U(F) + ρ + u + B(δ, m U, F) Namely, with probability no less than 1 2δ, bfρ lies in F1/ρ+u+B(δ,m U,F), which is a fixed hypothesis class. Step 2. Bounding the generalization gap of Lρ. Since bfρ F1/ρ+u+B(δ,m U,F), we can bound the generalization gap of Lρ using the uniform convergence property of F1/ρ+u+B(δ,m U,F). By standard decomposition, Lρ( bfρ) Lρ(fρ) = Lρ( bfρ) b Lρ( bfρ) | {z } ( ) + b Lρ( bfρ) b Lρ(fρ) | {z } 0 + b Lρ(fρ) Lρ(fρ) | {z } ( ) (44) For term ( ), combining the two inequalities in Lemma 3.4 and Step 1 via union bound, we know with probability 1 4δ, ( ) Rm L(F1/ρ+u+B(δ,m U,F)) + Rm U(F1/ρ+u+B(δ,m U,F)) + For term ( ), using Hoeffding s inequality for the risk and violation separately, we have with probability 1 2δ, By union bound, with probability 1 6δ, Lρ( bfρ) Lρ(fρ) Rm L(F1/ρ+u+B(δ,m U,F)) + ρRm U(F1/ρ+u+B(δ,m U,F)) + 2 2m U | {z } for convenience, denote these terms as B On Regularization and Inference with Label Constraints Step 3. Bounding the risk of fρ. By Step 2, we have with probability 1 6δ, R( bfρ) R(fρ) + ρV (fρ) ρV ( bfρ) + B R(f0) + ρV (f0) ρV ( bfρ) + B R(f0) + ρV (f0) ρV (f ) + B (48) We conclude that with probability 1 6δ, R( bfρ) R(f0) + ρV (f0) ρV (f ) + Rm L(F1/ρ+u+B(δ,m U,F)) + ρRm U(F1/ρ+u+B(δ,m U,F)) + 2 as claimed. B.5. Proof of Example 3.7 The normalizing factor c j=1 ew T j x is maximized at w1 = x = [1, 0, 0, . . . , 0] and w2 = = wc = 0 so that c j=1 ew T j x e + (c 1) c + 2 (50) This implies Pw(yc) (ew T c x)/(c + 2). Therefore, E[Pw(yc)] t implies t(c + 2) E[ew T c x] e E[w T c x] = eαTwc, or equivalently αTwc log(t(c + 2)). Therefore, given a set of data S = {xi}m i=1 and Rademacher random variables ϵ, the inner supremum in the definition of Rademacher complexity can be upper bounded by solving the following program c j=1 ϵi,jw T j xi c j=1 w T j wj 1 αTwc log(t(c + 2)) Consider its Lagrangian L(w, λ, µ) = c j=1 ϵi,jw T j xi + λ n j=1 w T j wj + ν log(t(c + 2)) αTwc (52) Denote ξj := m i=1 ϵi,jxi. The Lagrangian is then maximized at wj = ξj/(2λ) for j < c and wc = (ξc να)/(2λ). The dual function then writes: g(λ, ν) = ν log(t(c + 2)) + λ + ξj 2 2 4λ + ξc να 2 2 4λ ν log(t(c + 2)) + c 1 j=1 ξj 2 2 + ξc να 2 2 (53) By weak duality, we have that ν log(t(c + 2)) + c 1 j=1 ξj 2 2 + ξc να 2 2 Assuming t < 1/(c + 2) so that log(t(c + 2)) < 0. We can upper bound (54) as c 1 j=1 ξj 2 2 + ξc να 2 2 On Regularization and Inference with Label Constraints The function c 1 j=1 ξj 2 2 + ξc να 2 2 is minimized at ν = 0 if ξT c α 0 and ν = ξT c α/ α 2 2 otherwise. Denote the event ξT c α 0 as E. By symmetry, we have that P(E) = 1/2 so that c 1 j=1 ξj 2 2 + ξc να 2 2 c j=1 ξj 2 2 c j=1 ξj 2 2 (ξTc α)2 Again by symmetry, the quantity (ξT c α)2 is independent of E. Therefore, by Jensen s inequality, we have that c j=1 ξj 2 2 (ξTc α)2 v u u t ES,ϵ " c j=1 ξj 2 2 (ξTc α)2 cm Var(ξTc α) cm mσ2 α 2 2 + α 4 2 α 2 2 (c σ2 α 2 2)m Similarly, we can use Jensen s inequality to bound ES,ϵ hq c j=1 ξj 2 2 E i cm. Putting these together, we have that Rm(Ft) = Ex[ b Rm(Ft)] 1 c σ2 α 2 2 m (58) C. Proofs from Section 4 C.1. Proof of Propostion 4.2 First, we show the Rademacher complexity of the singleton mapping is zero: Rm({(x, y) 7 µv(x, y)}) = 1 " m i=1 y Y ϵi,yµv(xi, y) " m i=1 y Y E[ϵi,y]µv(xi, y) Second, we use the linearity of Rademacher complexity to obtain the desired result. m i=1 y Y ϵi,y(f(xi, y) µv(xi, y)) m i=1 y Y ϵi,yf(xi, y) " m i=1 y Y ϵi,yµv(xi, y) = Rm(F) + Rm({(x, y) 7 µv(x, y)}) = Rm(F) On Regularization and Inference with Label Constraints C.2. Proof of Proposition 4.3 (a) Given any scoring function f, let ZC f (x) := y C(x) exp(f(x, y)) and Z C f (x) := y / C(x) exp(f(x, y)). We have d dµ µ CE(f) = d log exp(f(x, yora) µv(x, yora)) ZC f (x) + Z C f (x)/eµ " d dµ log exp(f(x, yora) µv(x, yora)) ZC f (x) + Z C f (x)/eµ " Z C f (x)/eµ ZC f (x) + Z C f (x)/eµ v(x, yora) = V (f µ) Vora Moreover, d dµV (f µ) =E Z C f µ (x) " Zf µ(x)( ZC f µ(x)) + (ZC f µ(x))2 =E P2 f µ( C) Pf µ( C) which is negative and bounded, implying V (f µ) Vora is decreasing and Lipschitz with µ. Therefore, there is a µ > 0 such that RCE(f µ) < RCE(f) if and only if the derivative is positive at µ = 0, i.e., V (f) > Vora. (b) By (61), µ CE(f) = Z µ V (f t) Vora dt Z C f (x)/et ZC f (x) + Z C f (x)/et dt Z C f (x)/et ZC f (x) + Z C f (x)dt " Z C f (x) ZC f (x) + Z C f (x) = (1 e µ)V (f) µVora (c) If Vora = 0, we have " Z C f (x)/et ZC f (x) + Z C f (x)/et Z C f (x)/et ZC f (x) + Z C f (x)/et dt ZC f (x) + Z C f ZC f C.3. Proof of Corollary 4.5 Using Proposition 4.3 (b), this result follows by solving the following equation (1 e µ)V (f) µVora 0 (65) On Regularization and Inference with Label Constraints It is known that the solution to the inequaltiy u a + becu of u is u a 1 c W( bceac). Substituting a = η = V (f)/Vora = b and c = 1 yields the desired result: µ W( η/eη) + η (66) where the RHS is positive only when η > 1. A plot of this solution as a function of η is presented below in Figure 2. Figure 2. Choice of µ as a function of η = V (f)/Vora. C.4. Proof of Proposition 4.6 This claim follows from the fact that RCE(f ) = RCE(f) VCE(f) from Proposition 4.3 (c). For equation (23), the first inequality follows from the optimality of fon. For the second inequality, by definition we have RCE(f post) + VCE(fpost) = RCE(fpost) RCE(fon) RCE(f post) RCE(fon) VCE(fpost) RCE(fon) min f F VCE(f) (67) D. Analysis for Hinge Loss and ℓ1 Loss D.1. Hinge Loss The margin of a scoring function f at a sample (x, yora) is defined as m(x, yora, f) := max y Y {f(x, y)} f(x, yora) (68) We denote its expectation as M(f) = E[m(x, yora, f)]. Given a loss function ℓ: Y Y R, the structured hinge loss (London et al., 2016; Meshi et al., 2019) is defined as the margin of the loss augmented scoring function f + ℓ: (x, y) 7 f(x, y) + ℓ(y, yora). Namely, Lhinge(x, yora, f) := m(x, yora, f + ℓ) (69) Therefore, we can study the impact of constrained inference on the hinge loss via the impact on the margin. Let µ margin(f) = M(f) M(f µ). We present the following result. Theorem D.1 (Change of Margin). The following results hold: (a) For any fixed model f, there exists an µ0 > 0 such that M(f µ) M(f) only if V01(f) > Vora (70) where V01(f) is the zero-one style violation defined as E[1{argmaxy Y f(x, y) = yora}]. (b) In particular, if the constraint is noise-free, we have margin(f) = E max y Y f(x, y) max y C(x) f(x, y) " max y / C(x) f(x, y) max y C(x) f(x, y) On Regularization and Inference with Label Constraints Proof. (a) The derivative of the change of the margin is d dµ µ margin(f) = d dµM(f µ) = d dµE max y Y {f(x, y) µv(x, y)} f(x, yora) + µv(x, yora) = E[v(x, yf µ) v(x, yora)] (72) where yf µ := argmaxy Y{f(x, y) µv(x, y)} is the argmax inference output of CCM. Moreover, this derivative is non-increasing with µ. Therefore, a necessary condition for CCM to reduce the margin is E[v(x, yf)] = V01(f) > Vora (73) (b) This follows directly by taking the difference between M(f) and M(f ). Remark D.2. Due to the discontinuous nature of the argmax inference, the function v(x, yf µ) is in general not continuous with µ. On the other hand, if we assume µ 7 E[v(x, yf µ)] is Lipschitz continuous, the condition proposed in (a) is also sufficient, as in the analysis for cross-entropy. The impact of constrained inference on the hinge loss can be investigated by substituting f by f +ℓ. For example, a sufficient for improving the average hinge loss will be V01(f + ℓ) > Vora. The quantity maxy / C(x) f(x, y) maxy C(x) f(x, y) + is closely related to the integrality loss defined in Meshi et al. (2019). It is a hinge-stye surrogate loss function for the zero-one style violation function of f with argmax inference: P max y / C(x) f(x, y) max y C(x) f(x, y) 0 = V01(f) (74) D.2. ℓ1 Loss To facilitate our discussion, we first present the following lemmas that will be useful in this section. Lemma D.3 (Gradients of CCM). For any constraint C we have the following: 1. The derivative of the predicted probability is d dµPf µ(y|x) = Pf µ(y) (Pf µ( C|x) v(x, y)) (75) 2. The second order derivative of the probability is d dµPf µ( C|x) = Pf µ(y|x) (Pf µ( C|x) v(x, y))2 + P2 f µ( C|x) Pf µ( C|x) (76) Proof. Recall that given any scoring function f, we denote ZC f (x) := y C(x) exp(f(x, y)) and Z C f (x) := y / C(x) exp(f(x, y)) We also let Zf(x) = ZC f (x) + Z C f (x). (a) The pointwise derivative of CCM s l1 risk with respect to µ is then d dµPf µ(y|x) = d dµ ef(x,y) µv(x,y) (Zf µ(x))2 Zf µ(x)( v(x, y)ef(x,y) µv(x,y)) + Z C f µ (x)ef(x,y) µv(x,y) =Pf µ(y) (Pf µ( C) v(x, y)) where the second equality follows from the fact that d dµZf µ(x) = Z C f µ (x). On Regularization and Inference with Label Constraints (b) Based on (a), d2µPf µ(y|x) = (Pf µ(y) (Pf µ( C) v(x, y))) (Pf µ( C) v(x, y)) + Pf µ(y) P2 f µ( C) Pf µ( C) = Pf µ(y|x) (Pf µ( C|x) v(x, y))2 + P2 f µ( C|x) Pf µ( C|x) (78) Now we discuss the change in ℓ1 risk that is defined as µ(f) := R(f) R(f µ). Theorem D.4 (Change of ℓ1 Risk). The following results hold: (a) For any fixed model f, there exists an µ0 > 0 such that R(f µ) < R(f) if E[Pf(yora)Pf( C)] > E[Pf(yora)v(x, yora)] (79) (b) The change of risk can be lower bounded by µ(f) 1 e 2µ 2 Ex[Pf(yora)Pf( C)] µVora (80) (c) In particular, if the constraint is noise-free, we have (f) Ex[Pf(yora)Pf( C)] (81) Proof. (a) From Lemma D.3 (a) we know the derivative of the risk with respect to µ at µ = 0 is E[Pf(yora)Pf( C)] E[Pf(yora)v(x, yora)] (82) Further, Lemma D.3 (b) implies this derivative is Lipschitz with respect to µ since for any µ, Pf µ(y|x) (Pf( C|x) v(x, y))2 + P2 f µ( C|x) Pf µ( C|x) 1 (83) Therefore, a sufficient condition for the existence of an µ0 > 0 such that R(f µ) < R(f) is that E[Pf(yora)Pf( C)] > E[Pf(yora)v(x, yora)]. (b) First, we note for any y and µ that Pf µ(y)Pf µ( C) = ef(x,y) µv(x,y)Z C f (x)/eµ ef(x,y) µv(x,y)Z C f (x)/eµ ef(x,y) µZ C f (x)/eµ = Pf(y)Pf( C)e 2µ Also, E[Pf(yora)v(x, yora)] E[v(x, yora)] = Vora (85) Integrating the derivative gives 0 E Pf(yora)Pf( C)e 2t Vora dt 2 Ex[Pf(yora)Pf( C)] µVora On Regularization and Inference with Label Constraints (c) With noise-free constraints, Pf µ(yora)Pf µ( C) = ef(x,yora)Z C f (x)/eµ ef(x,yora)Z C f (x)/eµ = Pf(yora)Pf( C)e µ Integrating both sides gives 0 E Pf(yora)Pf( C)e t dt = Ex[Pf(yora)Pf( C)] (88) The term Ex[Pf(yora)Pf( C)] plays a key role in these results, and it measures the average violation of the model f, weighted by the model s confidence of the true label. The first result shows that if this weighted average violation is larger than that of the true data distribution, then CCM is helpful. The last result shows that a model with a larger weighted violation obtains more benefits from strictly constrained inference. E. Proofs from Section 5 E.1. Proof of Theorem 5.1 Recall f µ = argming Fµ RCE(g) + ρVCE(g) is the optimal CCM for the regularized surrogate objective and fpost is the cross entropy risk minimizer in F. According to our notation, f µ post is the constrained model with base model fpost. By this definition, we have RCE(f µ ) + ρVCE(f µ ) RCE(f µ post) + ρVCE(f µ post) (89) Therefore, RCE(f µ ) RCE(f µ post) + ρ(VCE(f µ post) VCE(f µ )) RCE(f µ post) + ρVCE(f µ post) RCE(fpost) µ CE(fpost) + ρVCE(f µ post) Therefore, a sufficient condition for RCE(f µ ) RCE(fpost) is that ρVCE(f µ post) < µ CE(fpost). Furthermore, recall for any scoring function f, we define ZC f (x) := y C(x) exp(f(x, y)) and Z C f (x) := y / C(x) exp(f(x, y)). We then have VCE(f) VCE(f µ) = E ZC f (x) + Z C f (x) ZC f (x) + Z C f (x)/eµ ZC f (x) + Z C f (x)/eµ ZC f (x) + Z C f (x) " Z C f (x)/et ZC f (x) + Z C f (x)/et = µ CE(f) + µVora (compare to equation (61)) Therefore, µ CE(fpost) = VCE(fpost) VCE(f µ post) µVora. So, the sufficient condition can be reformulated as ρ < VCE(fpost) VCE(f µ post) µVora VCE(f µ post) (92) On Regularization and Inference with Label Constraints E.2. Proof of Theorem 5.2 We have seen in Theorem 4.3 that for any scoring function f, there is a µ > 0 such that RCE(f µ) < RCE(f) if and only if V (f) Vora. On the other hand, we know from Lemma 3.3 that V (fρ) V (f ) + 1 Therefore, if ρ 1 Vora V (f ) (94) we must have V (fρ) Vora, which implies there is no µ > 0 such that RCE((fρ)µ) < RCE(fρ).