# generalization_analysis_for_controllable_learning__2bb8888d.pdf Generalization Analysis for Controllable Learning Yi-Fan Zhang * 1 2 Xiao Zhang * 3 4 Min-Ling Zhang 2 5 Controllability has become a critical issue in trustworthy machine learning, as a controllable learner allows for dynamic model adaptation to task requirements during testing. However, existing research lacks a comprehensive understanding of how to effectively measure and analyze the generalization performance of controllable learning methods. In an attempt to move towards this goal from a generalization perspective, we first establish a unified framework for controllable learning. Then, we develop a novel vector-contraction inequality and derive a tight generalization bound for general controllable learning classes, which is independent of the number of task targets except for logarithmic factors and represents the current best-in-class theoretical result. Furthermore, we derive generalization bounds for two typical controllable learning methods: embedding-based and hypernetwork-based methods. We also upper bound the Rademacher complexities of commonly used control and prediction functions, which serve as modular theoretical components for deriving generalization bounds for specific controllable learning methods in practical applications such as recommender systems. Our theoretical results without strong assumptions provide general theoretical guarantees for controllable learning methods and offer new insights into understanding controllability in machine learning. *Equal contribution 1School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China 3Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 4Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE 5School of Computer Science and Engineering, Southeast University, Nanjing 210096, China. Correspondence to: Min-Ling Zhang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction Controllability has emerged as a crucial aspect of AI (Yampolskiy, 2020; Artificial Intelligence Safety Summit, 2023; A16z, 2025), enabling AI models to be more reliable and adaptable to user needs in dynamic environments. In the context of machine learning, controllability ensures that learners can dynamically adapt to evolving task requirements at test time, a concept known as controllable learning. In many real-world applications, such as information retrieval and recommender systems, the ability to control model behavior at test time is essential for accommodating diverse user preferences, balancing competing objectives, and adapting to dynamic information-seeking needs (Chen et al., 2023; Mysore et al., 2023; Gao et al., 2024; Chang et al., 2023; He et al., 2022). Existing methods for controllable learning including embedding-based methods that adjust model inputs, and hypernetwork-based methods that generate task-specific model parameters, have demonstrated empirical success (Chen et al., 2023; Chang et al., 2023; Shen et al., 2025; Xie et al., 2025). However, theoretically understanding controllability in machine learning remains an important open question in machine learning theory. Theoretical analysis of controllable learning methods from the perspective of generalization is an important research avenue. A comprehensive generalization analysis of controllable learning faces two main challenges: 1) How to establish the relationship between the generalization bounds and controllability? 2) How to reduce the dependency of the generalization bounds on the number of task targets? First, controllable learning often needs to dynamically respond to the task requirement and often involves multiple task targets. How to establish a unified theoretical framework that can formally characterize these factors is the primary issue of theoretical analysis. In addition, the ideal theoretical framework should be general enough to cover existing controllable learning methods and facilitate the development of general analysis methods and theoretical tools for generalization analysis of controllable learning. With an effective unified theoretical framework, we can explicitly introduce controllability in generalization analysis. Second, intuitively, the multiple task targets involved in controllable learning suggest that learning will become more difficult as the number of task targets increases, but the empirical success of controllable learning methods suggests that the Generalization Analysis for Controllable Learning impact of increased difficulty should be limited (Chen et al., 2023; He et al., 2022; Chang et al., 2023), meaning that the effective generalization bound is weakly dependent on the number of task targets. It is clear that generalization analysis can promote a better understanding of controllable learning methods. In this paper, we establish an effective unified theoretical framework and derive tight bounds for controllable learning. Specifically, we develop a novel vector-contraction inequality, which induces the state-of-the-art bound with no dependency on the number of task targets except for logarithmic factors for general controllable learning classes. In addition, we derive general bounds for two typical controllable learning methods, and we also develop modular theoretical results for commonly used control and prediction functions and show that the bounds for the specific controllable learning methods are flexible combinations of these modular theoretical results. Our goal is to deepen our understanding of controllable learning methods through the systematic generalization analysis. Major contributions of the paper include: We establish a unified theoretical framework and develop a novel vector-contraction inequality for controllable learning, which exploits the Lipschitz continuity of losses w.r.t. ℓ norm and can induce tight bounds. We derive tight bounds for general controllable learning classes with no dependency on the number of task targets, which provides general theoretical guarantees for controllable learning methods. We derive bounds for two typical controllable learning methods and reveal that different manipulation methods based on the input and control function will lead to significant differences in the bounds. The theoretical techniques and modular results on FNN and Transformer here serve as promising theoretical tools for the generalization analysis of controllable learning. 2. Related Work The theoretical analysis in this paper focuses on controllable machine learning methods. Controllable learning emphasizes the model s ability to adapt dynamically at test time based on task requirements to meet specific task targets. It can be broadly categorized into the following two categories. Embedding-based controllable learning focuses on mapping task requirements into embeddings (Gao et al., 2024; Mysore et al., 2023; Penaloza et al., 2024; Chang et al., 2023; Kong et al., 2024), which are then integrated with other inputs into the original model, serving as a preprocessing method during testing. Mysore et al. (2023) introduced LACE, an embedding-based controllable recommendation model that constructs editable user profiles from human-readable concepts, allowing users to modify them for adaptive recommendations without retraining. Chang et al. (2023) proposed PEPNet to address the imperfectly double seesaw problem by incorporating controllable parameters in dynamic, multi-domain recommendation scenarios. This approach facilitates controllable and personalized predictions of user interactions across diverse tasks and domains. Hypernetwork-based controllable learning generates (partially) new model parameters through a control function based on the given task requirements, replacing the original model parameters to enable adaptive adjustment (Chen et al., 2023; Shen et al., 2023; He et al., 2022; Li et al., 2023; Yan et al., 2022). Since this approach modifies the mapping between task requirements and model parameters during testing, it can also be considered an in-processing method. Chen et al. (2023) proposed a novel framework named CMR, which leverages a feedforward neural network as a hypernetwork to dynamically generate parameters for a re-ranking model based on varying preference weights. This approach enables online adaptability without retraining and has demonstrated positive gains in online A/B tests within e-commerce scenarios. He et al. (2022) presented Hyperprompt, a technique offering controllable task-conditioning of transformers by dynamically adjusting prompts using a hypernetwork for diverse tasks. Controllable learning has broad applications in information access. Existing theoretical work has focused solely on the approximation properties of controllable learning (Galanti & Wolf, 2020), while the study of its learning properties, particularly generalization analysis, remains an open problem. This paper aims to bridge the gap in the generalization analysis of controllable learning and provide theoretical tools for a broader exploration of its learning properties. 3. Preliminaries Let [n] := {1, . . . , n} for any natural number n. In the context of controllable learning, given a dataset D = {(x1, y1) , . . . , (xn, yn)} with n examples which are identically and independently distributed (i.i.d.) from a probability distribution P on X Y, where X Rd denotes the d-dimensional input space and Y denotes the label space, x X, y Y. 3.1. Controllable Learning Unlike traditional learning methods, where the learning objective is fixed during the training phase and cannot change in the testing phase, controllable learning aims to dynamically adapt to newly arrived task requirements during testing, thereby achieving learner controllability. More specifically, Generalization Analysis for Controllable Learning in controllable learning, the learner receives not only the input x X but also the task requirement z Z during the testing phase. It adjusts adaptively based on any given task requirement z Z, ensuring that the outputs for c task targets are responsive to the dynamic changes in z. Formally, for Y Rc, the task of controllable learning is to solve the problem of multi-output regression with a given task requirement z, i.e., the j-th task target of the learner f z = (f z 1 , . . . , f z c ) : X 7 Rc corresponds to the regression of f z j , j [c]. For Y { 1, +1}c, each y = (y1, . . . , yc) is a binary vector and yj = 1( 1) denotes that the j-th label is (ir)relevant, j [c], and the task of controllable learning is to learn a classifier corresponding to the task requirement z, which assigns each instance with a set of relevant labels. A common strategy is to learn a vector-valued function f z = (f z 1 , . . . , f z c ) : X 7 Rc and derive the classifier by a thresholding function which divides the label space into relevant and irrelevant label sets. We consider the prediction function (also called controllable learner) for task target j [c] of the general form: f z j (x) = wj, ζj(φj(x, ψj(z))) , where the nonlinear mapping ψj serves as a control function, while the manipulation function φj represents the nonlinear transformation that integrates the input and control function into the learner, and the nonlinear mapping ζj corresponds to the classifier learned on the controllable representation generated by φj. We define a vector-valued function class of controllable learning as: F = {x 7 f z(x) : f z(x) = (f z 1 (x), . . . , f z c (x)), f z j (x) = w j ζj(φj(x, ψj(z))), w = (w1, . . . , wc) Rd c, α(w) Λ, β(ζ( )) A, γ(φ( , )) B, κ(ψ( )) C, x X, z Z, j [c], Λ, A, B, C > 0}, (1) where α represents a functional that constrains weights, β represents a functional that constrains nonlinear mappings ζj, γ represents a functional that constrains nonlinear mappings φj, κ represents a functional that constrains control functions ψj. For embedding-based controllable learning methods, the task requirement z can be editable user profiles, the control function ψ often uses Transformers, and the nonlinear mapping ζ induced by classifier can use FNNs. For hypernetwork-based controllable learning methods, the task requirement z can be a task indicator, the model corresponding to the control function is a hypernetwork, and the nonlinear mapping ζ induced by classifier can use Transformer-based models. For any function f z : X 7 Y, the prediction quality on the example (x, y) is measured by a loss function ℓ: X Y 7 R+. The goal of controllable learning is to find a hypothesis f z F with good generalization performance from the dataset D by optimizing the loss ℓ. The generalization performance is measured by the expected risk: R(f z) = E(x,y) P [ℓ(f z(x), y)]. We denote the empirical risk w.r.t. training dataset D as b RD(f z) = 1 n Pn i=1 ℓ(f z(xi), yi). In addition, we define the loss function space as L = {ℓ(f z(x), y) : f z F}. However, the above mentioned loss is typically the 0-1 loss, which is hard to handle in practice. Hence, one usually consider its surrogate losses. 3.2. Related Evaluation Metrics Although controllable learning has been implicitly used in modern information retrieval. However, there is still a lack of specific evaluation metrics to measure the generalization performance of different controllable learning methods. Here we focus on two commonly used evaluation metrics in controllable learning, i.e., weighted Hamming loss and bipartite ranking loss, and define their surrogate losses. The surrogate loss for weighted Hamming loss is denoted by: ℓW (f z(x), y) = 1 j=1 vjℓj f z j (x), yj , where ℓj is the loss of the j-th task target, vj is the weight for the j-th task target and is bounded by |vj| V for any j [c]. For bipartite ranking problems in controllable learning, since they involve pairwise losses, we need to additionally define the corresponding risks. Let pj be the probability that the samples are relevant to the j-th label. D+ j denotes the conditional distribution of the samples over X given that the samples are relevant to the j-th label, and D j denotes the conditional distribution of the samples over X given that the samples are irrelevant to the j-th label. We define the expected risk w.r.t. multi-target bipartite ranking for controllable learning as follows: j pj R(f z|j) j pj Exi D+ j x i D j j=1 ℓ0/1 j f z j (xi) f z j (x i) . In addition, the empirical risk w.r.t. multi-target bipartite ranking is defined as b RD(f z) = X n b RD(f z|j) = (2) n 1 X+ j X j X xi X+ j x i X j j=1 ℓ0/1 j f z j (xi) f z j (x i) , where X+ j = {xi | yj = +1, i [n]} (X j = {x i | yj = 1, i [n]}) corresponds to the set of the samples that are Generalization Analysis for Controllable Learning relevant (irrelevant) to the j-th label. The surrogate loss for multi-target bipartite ranking loss is denoted by: ℓR(f z(xi, x i), y) = 1 j=1 ℓj f z j (xi) f z j (x i) , where xi (x i) corresponds to the instances that are relevant (irrelevant) to the j-th label. 3.3. Related Complexity Measures Here we introduce the related complexity measures involved in our theoretical results. The Rademacher complexity is used to perform generalization analysis for controllable learning. Definition 3.1 (Rademacher complexity). Let F be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical Rademacher complexity over F is defined by ˆℜD(F) = Eϵ i=1 ϵif (xi) where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables. In addition, we define the worst-case Rademacher complexity as ℜn(F) = sup D X n ˆℜD(F). The vector-valued function class F of controllable learning makes traditional analysis methods developed for the Rademacher complexity of scalar-valued function class invalid. Hence, we need to new analysis tools to convert the Rademacher complexity of a loss function space associated with F into the Rademacher complexity of a tractable scalarvalued function class. The Rademacher complexity can be bounded by other scale-sensitive complexity measures, e.g. covering number and fat-shattering dimension (Srebro et al., 2010; Zhang & Zhang, 2023). The relevant definitions are provided in the appendix. 4. General Bounds for Controllable Learning In this section, we first introduce the assumptions used. Then, we develop a novel vector-contraction inequality for the Rademacher complexity of the vector-valued function class F. Finally, with the novel vector-contraction inequality, we derive bounds for general function classes of controllable learning with no dependency on the number of task targets, up to logarithmic terms, which achieve the state of the art. The proof sketches and detailed proofs of the theoretical results in this paper are provided in the appendix. Assumption 4.1. Assume that the input features, the loss function and each task target of the learner are bounded: xi 2 R, ℓ( , ) M, |f z j ( )| E for i [n], j [c], where R, M, E > 0 are constants. Assumption 4.2. Assume that the loss function ℓis µLipschitz continuous w.r.t. the ℓ norm, that is: ℓ(f(x), ) ℓ(f (x), ) µ f(x) f (x) , where µ > 0, t = maxj [c] |tj| for t = (t1, . . . , tc). Assumption 4.1 and 4.2 are mild assumptions. For Assumption 4.1, The boundedness of the input features is easy to satisfy since in practice normalization is often applied to the input features. In addition, for the function class of controllable learning (1), we often use the assumptions wj 2 Λ, ζj( ) 2 A for any j [c] to replace the boundedness of each task target of the learner, i.e., E := ΛA, and A can be further refined by the specific model used in controllable learning. For Assumption 4.2, the Lipschitz continuity w.r.t. the ℓ norm has been considered in several literature (Foster & Rakhlin, 2019; Lei et al., 2019; Wu et al., 2021; Zhang & Zhang, 2023; 2024a). The following Proposition 4.3 further illustrates that the commonly used loss functions in controllable learning actually satisfy Assumption 4.2. Proposition 4.3. Assume that the loss of each output task target ℓj defined in Subsection 3.2 is µ-Lipschitz continuous, then the surrogate weighted Hamming Loss is µV -Lipschitz w.r.t. the ℓ norm, the surrogate multi-target bipartite ranking loss is µ-Lipschitz w.r.t. the ℓ norm. In fact, each task target in controllable learning can be completely different, hence the coupling relationship between the weights of the functions corresponding to each task target needs to be decoupled. To this end, we define a projection operator pj : Rc 7 R to project the c-dimensional output vector onto the j-th coordinate, pj P, j [c]. Then, we have the projection function class P(F) = (j, x) 7 pj(f z(x)) : pj(f z(x)) = f z j (x), f z F, (j, x) [c] X}, which decouples the relationship among different task targets. With Assumption 4.2 and the above definitions, we develop the following novel vectorcontraction inequality for controllable learning to show that the Rademacher complexity of the loss function space L can be bounded by the worst-case Rademacher complexity of the projection function class P(F): Lemma 4.4. Let F be a function class of controllable learning defined by (1). Let Assumptions 4.1 and 4.2 hold. Given a dataset D of size n. Then, we have ˆℜD(L) 12M n +96µ ceℜnc(P(F)) (1 + log2(4en2c2µ2) ln M n where ˆℜD(L) = Eϵ supℓ L,f z F 1 n Pn i=1 ϵiℓ(f z(xi)) is the empirical Rademacher complexity of the loss function space associated with F, and eℜnc(P(F)) is the worst-case Rademacher complexity of the projection function class. Generalization Analysis for Controllable Learning Remark 4.5. The difficulty of theoretical analysis for controllable learning lies in two aspects. First, the development of controllable learning is driven by real-world applications, and the methods developed are quite different. It is difficult to establish a unified theoretical framework to cover all these typical methods. The lack of a unified framework is the most intuitive and primary obstacle to using existing analytical tools to establish an effective theory bound for controllable learning. The definition of the controllable learning function class and the modular theoretical results provided ensure the establishment of a unified theoretical framework. Second, about reducing the dependency on c. The analysis of the number of task targets in the bounds can be traced back to a basic bound with a linear dependency on c, which comes from the typical vector-contraction lemma in (Maurer, 2016). The dependency on c can be improved to square-root or logarithmic by preserving the coupling among different components, i.e., w Λ. However, each task target in controllable learning can be completely different, and the coupling relationship needs to be decoupled. Hence, we introduce the projection operator. We found that the square-root dependency on c is inevitable for ℓ2 Lipschitz loss, which essentially comes from the c factor in the radius of the empirical ℓ2 cover of the projection function class, but ℓ Lipschitz continuity of the loss can eliminate it. Hence, the tight bounds with no dependency on c, up to logarithmic terms, can be derived. Remark 4.6. The construction of the auxiliary dataset induced by projection functions can reduce the dependency on the output dimension. For the construction of other types of auxiliary datasets under different problem settings and similar ideas, please refer to (Lei et al., 2019; Wu et al., 2021; Mustafa et al., 2021; Hieu et al., 2024; Lei et al., 2023; Graf et al., 2022). Regarding the more desirable constant, the use of discretized variant of Dudley s integral inequality in (Lei et al., 2023) suggests that this can be achieved. With the vector-contraction inequality above, we can derive the following tight bound for the surrogate weighted Hamming loss: Theorem 4.7. Let F be a function class of controllable learning defined by (1). Let Assumptions 4.1 and 4.2 hold. Given a dataset D of size n. Then, for the surrogate weighted Hamming loss, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f z F: R(f z) b RD(f z) + 3M δ 2n + 24M n + 192µV ΛA(1 + log2(4en2c2µ2) ln M n Remark 4.8. Although Lemma 4.4 contains a c factor, the term eℜnc(P(F)) ΛA nc, which makes the Rademacher complexity of the loss function space L actually independent on c, and results in a tight e O(1/ n) bound with no dependency on the number of task targets. The projection function class combined with the ℓ norm Lipschitz continuity of loss functions is the key to inducing a generalization bound with no dependent on c except for logarithmic factors. Theorem 4.7 shows that when the number of task targets increases, the generalization bound will be slightly larger, i.e., only a logarithmic increase. This means that although the increase in the number of task targets will affect the difficulty of learning, since Theorem 4.7 implies that the increase in difficulty is logarithmic, the real impact may only come from a few important task targets. Therefore, a controllable learning method can eventually obtain a learner with good generalization performance as long as it can learn these task targets well. Our theoretical results can provide general theoretical guarantees for controllable learning methods that can handle many task targets well in practice (Chen et al., 2023; He et al., 2022; Chang et al., 2023). Since the surrogate multi-target bipartite ranking loss involves pairwise functions, a sequence of pairs of i.i.d. individual observation in (2) is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable. Inspired by (Cl emenc on et al., 2008), we convert the non-sum-of-i.i.d pairwise function to a sum-of-i.i.d form by using permutations in U-process. Hence, for each task target, we define the construction method of the set of i.i.d disjoint positive and negative sample pairs for the j-th class label as follows: 1. We denote X+ j and X j as tj and uj and tj + uj = n for any j [c]. We denote the number of disjoint positive and negative sample pairs as sj = min{tj, uj}. 2. We uniformly select a positive sample from the set of the samples that are relevant to the j-th label and select a negative sample from the set of the samples that are irrelevant to the j-th label, then construct the selected positive and negative samples into a pair (xj+ i , xj i ). 3. We construct the set of positive and negative sample pairs by matching the samples from the set of the samples that are relevant to the j-th label with the samples from the set of the samples that are irrelevant to the j-th label until one of the sets of the positive samples and the negative samples exhausts its available samples for selection. We denote the set of i.i.d disjoint positive and negative sample pairs for the j-th class label as Dj, and |Dj| = sj. With these definitions, we then derive the tight bound for the surrogate multi-target bipartite ranking loss as follows: Theorem 4.9. Let F be a function class of controllable learning defined by (1). Let Assumptions 4.1 and 4.2 hold. Generalization Analysis for Controllable Learning Given a dataset D of size n. Then, for the surrogate multitarget bipartite ranking loss, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f z F: R(f z) b RD(f z) + 24 2Mq s0 + 22Mq 2qµΛA (1 + log2(4es2 0c2µ2) ln M s0 where s0 = n min{minj pj, minj(1 pj)} and q = P Remark 4.10. The above bound shows a logarithmic dependency on the number of task targets with a faster convergence rate e O(1/ s0). When the examples of positive and negative classes are balanced, which is the ideal situation, we have tj = uj for any j [c], s0 = n 2 , then the order of the bound is e O(1/ n). When class-imbalance occurs, it is obvious that s0 will be smaller than n 2 . If class-imbalance is more serious, s0 will be smaller, which will lead to a looser bound for the surrogate multi-target bipartite ranking loss. It means that when class-imbalance becomes more and more serious, if the learned classifier cannot handle the problem of class-imbalance well, then its performance on the multi-target bipartite ranking loss will be worse. Remark 4.11. The term P j pj is involved in the above bound, which indicates that each sample is associated with multiple labels, and each task target can correspond to a label in controllable learning. Real-world scenarios and practical tasks imply that the presence of the term P j pj introduces two challenges. First, in label-sparse tasks, ensuring the applicability of the above theoretical result requires establishing a connection between the term P j pj and appropriate sparsity conditions, thereby revealing how sparsity influences generalization bounds. Second, in recommendation systems, while users may simultaneously prefer multiple product categories, excessive probability summation could lead to excessively long recommendation lists, thereby diminishing personalization efficacy. Therefore, designing suitable regularization terms may be necessary to address this issue. In future work, we will further explore solutions to the limitations induced by the term P j pj from these two perspectives. 5. General Bounds for Typical Controllable Learning Methods In this section, we analyze the generalization bounds for two typical controllable learning methods, i.e., embedding-based and hypernetwork-based controllable learning methods. In order to improve the readability and enhance the applicability of the developed theoretical tools, we decompose the process of generalization analysis of these controllable learning methods into multiple modules. In this way, the bounds corresponding to the specific controllable learning methods are flexible combinations of these modular theoretical components. Specifically, in Subsection 5.1, we establish refined formal definitions for embedding-based and hypernetwork-based methods (i.e., φj in class (1)) and give their general theoretical analysis methods. In Subsection 5.2, we give the formal definitions of commonly used control and prediction functions (i.e., ψj and ζj in class (1)), and derive their Rademacher complexity. In Subsection 5.3, we derive the corresponding generalization bounds for specific controllable learning methods. This process shows how to flexibly apply the modular theoretical results developed in this paper. We show that different manipulation methods based on the input and control function will lead to significant differences in the constant A of the generalization bound in Theorem 4.7 and 4.9. 5.1. Theoretical Analysis Methods for Embedding-based and Hypernetwork-based Methods Embedding-based Methods Embedding-based methods aim to map the task requirement z into a latent space of inputs x. This can be seen as a pre-processing technique at test time, where controllable learners adapt to the task requirement solely by modifying the input features, making them easy to implement in tasks like recommender systems or classification. Formally, for each task target, we denote the output of the control function as aj := ψj(z) Rd , φj is a concatenate function denoted as Concate( , ), which concatenate x and aj, i.e., Concate(x, aj) = [x1, . . . , xd, aj 1, . . . , aj d ] Rd+d . Then, a family of c classifiers fj with ρ-Lipschitz nonlinear mapping (i.e., ζj is ρ-Lipschitz) are learned on the generated concatenate representations. With the above definitions, we can derive the upper bound of the worst-case Rademacher complexity for embeddingbased controllable learning methods: Theorem 5.1. Let F be a function class of embeddingbased controllable learning methods defined by (1). Let Assumptions 4.1 and 4.2 hold. Given a dataset D of size n. Then, the worst-case Rademacher complexity of the corresponding projection function class can be bounded as: eℜnc(P(F)) 2ρΛ(R+C) nc . The constant A of the generalization bound in Theorem 4.7 and 4.9 corresponds to 2ρ(R + C) here. Hypernetwork-based Methods Hypernetwork-based methods use a network as the control function ψ to generate the model parameters of the prediction function, effectively allowing the control function to Generalization Analysis for Controllable Learning control and adjust the parameters of the prediction function. These methods can be seen as an in-processing technique at test time, offering flexibility in adapting model parameters to various task requirements. This makes hypernetworks an appealing approach for applications that require dynamic and task-specific parameter adjustments. Formally, for each task target, the weights of the nonlinear function φj are generated by a hypernetwork ψj (i.e., the model corresponding to the control function is a hypernetwork), we denote the generated vector as wψj, and φj can be denoted as φj(x; wψj). Furthermore, wψj is determined by minimizing the µctrl-Lipschitz loss ℓctrl defined on the control function ψj, and we denoted the loss function space corresponding to ℓctrl as Lctrl. We also define the class of control function ψj as Gj, j [c], and use G N c to refer the c-fold Cartesian product of the c function classes Gj. With the above definitions, we can derive the upper bound of the worst-case Rademacher complexity for hypernetworkbased controllable learning methods: Theorem 5.2. Let F be a function class of hypernetworkbased controllable learning methods defined by (1). Let Assumptions 4.1 and 4.2 hold. Given a dataset D of size n. Then, the worst-case Rademacher complexity of the corresponding projection function class can be bounded as: eℜnc(P(H)) 2ρΛB nc + 2µctrl C nc , where H = F + Lctrl G N c is the whole function class corresponding to the hypernetwork-based methods. Remark 5.3. Since hypernetwork-based controllable learning method is two-stage, the hypernetwork is used to generate a vector of weights (i.e., model parameters) for each task target in the first stage, and then these generated parameter vectors are used in the learner of the second stage. Therefore, in order to fully consider the capacity of the hypernetwork corresponding to the first stage, it is necessary and appropriate to define the whole function class as the sum of the function classes H = F +Lctrl G N c corresponding to the models of these two stages. The bound of the hypernetworkbased method can be obtained by combining Theorem 5.2 with Lemma 4.4, but eℜnc(P(F)) in Lemma 4.4 should be replaced by eℜnc(P(H)). The constant A of the bound in Theorem 4.7 and 4.9 corresponds to 2ρB here. However, unlike the embedding-based methods, the introduction of the hypernetwork class Gj leads to an additional increase in complexity, i.e., the last term in Theorem 5.2. 5.2. The Rademacher Complexities for Commonly-used Control and Prediction Functions In this Subsection, we first formally introduce the commonly used control or prediction functions in controllable learning, mainly including two models, i.e., Feedforward Neural Network (FNN) and Transformer, and then derive the upper bounds of their Rademacher complexities. These theoretical results can be used as components to derive the generalization bounds for specific controllable learning methods. Feedforward Neural Network (FNN) We define a feedforward neural network as follows: f(x) = w σ(WLσ(WL 1 σ(W1x))), where Wl are the parameter metrices, l [L], and σ is the Re LU activation. With the above definitions, we can derive the upper bound of the Rademacher complexity for FNN: Theorem 5.4. Let F be a function class of FNN. Assume that the parameter matrices in FNN are bounded, i.e., Wl BF for any l [L], where BF > 0 is a constant. Given a dataset D of size n. Then, the Rademacher complexity of 5 layer FNN can be bounded as: ˆℜD(F) 25ΛB5 F R n . Remark 5.5. According to the proof process, it is obvious that ˆℜD(F) 2LΛBL F R/ n for L layer FNN. The capacity-based generalization analysis of deep models involved in Subsection 5.2 mainly uses the peeling argument, which is a commonly used method in capacity-based theoretical analysis. The main idea of peeling is to reduce the complexity bound for l layer networks to a complexity bound for l 1 layer networks, and for each reduction, a product factor of a Lipschitz constant for the activation function and an upper bound for the weight matrix norm will be introduced. After applying l reductions, the multiplication of product factors with exponential dependency on the depth may make the bound vacuous (Neyshabur et al., 2015; Bartlett et al., 2017; 2019). How to develop non-vacuous capacity-based generalization analysis methods for deep models is still an open question (Neyshabur et al., 2018; Golowich et al., 2018; Zhang & Zhang, 2023) and we will further explore it in the future. However, our goal here is to provide modular theoretical results for generalization analysis of controllable learning, and the depth of the models used in controllable learning is often limited or even shallow, so the theoretical results in Subsection 5.2 are valid. Transformer We follow the definition and notation of self-attention and Transformer in (Edelman et al., 2022). Let WC Rk d, WV Rd k, and WQ, WK Rd k be trainable weight matrices. Let X := [x1, x2, . . . , x T ] RT d be the input, i.e., a sequence of T d-dimensional tokens. Let σ be an element-wise Lσ-Lipschitz activation function with σ(0) = 0. Then, a Transformer layer is defined as σ Row Softmax XWQW KX XWV WC, where Row Softmax applies softmax on each row of its input. For the convenience of analysis, we represent WQW K with a single matrix WQK Rd d. Here we focus on the scalar output from the Transformer. We construct a special extra input in the length-T sequence with a index [CLS], which is Generalization Analysis for Controllable Learning a fixed or trainable vector x[CLS] Rd. Finally, the output at index [CLS] is a linear function w y[CLS], for a trainable parameter w Rd. This setup for scalar output is used in BERT (Devlin et al., 2019) and all of its derivatives. The scalar one layer Transformer is w y[CLS], where y[CLS] = W C σ W V X softmax XW QKx[CLS] . Then, we have the bound of the Rademacher complexity for Transformer: Theorem 5.6. Let F be a function class of Transformer. Assume that the parameter matrices in Transformer are bounded, i.e., w Bw, WC BWC, WV BWV , WQK BWQK and xt R for any t T. Given a dataset D of size n. Then, the Rademacher complexity of the scalar one layer Transformer can be bounded as: ˆℜD(F) 4Bw BWC BWV BWQK LσR3 Remark 5.7. Theorem 5.6 implies that the order of the bound for Transformer is O( 1 n) with no dependency on the sequence length T, which is tight w.r.t. the dependency on T and matches the recent theoretical bound for Transformer (Trauger & Tewari, 2024). The key to reducing the dependency on T is to avoid summation in the norm, where the Lipschitz property of softmax is used, i.e., Corollary A.7 in (Edelman et al., 2022). The symbol denotes Frobenius norm. (Edelman et al., 2022) mainly derives bounds for Transformer through covering numbers. It uses the Lipschitz property of functions to convert the covering number of Transformer function class into those of vector-valued linear classes, and uses the upper bounds of covering numbers for vector-valued linear classes to obtain the result for Transformers. The 2,1 norm involved therein originates from the upper bounds of covering numbers for vector-valued linear classes. Specifically, when upper bounding the covering number of the vector-valued linear class, the summation of covering numbers for scalar-valued linear classes corresponding to each component in the vector-valued linear function induces the :,1 norm in the 2,1 norm. Our result is derived using the standard Rademacher analysis, following the conventional neural network analysis method, which naturally induce Frobenius norm. We do not emphasize the tightness regarding norm constraints here. Our main purpose here is to reduce the dependency on T, so we focus on analyzing a single layer Transformer as an example. For the bound of the deep case, similar to other deep models, the peeling argument is also used for analysis. The dependency of induced bound on depth is similar to existing theoretical results (Edelman et al., 2022; Trauger & Tewari, 2024). 5.3. Generalization Bounds for Specific Controllable Learning Methods Recently, the controllability of AI has become a key issue in many applications, particularly in information access applications such as recommender systems (Joachims, 2024; Shen et al., 2024). This subsection will use state-of-the-art controllable learning methods in recommender systems as examples to concretize the nonlinear mapping ζ and the control function ψ of controllable learner in (1), and further provide generalization bounds for embedding-based and hypernetwork-based controllable learning, which are flexible combinations of the above modular theoretical results. Bounds for Embedding-based Controllable Learning In embedding-based controllable recommendation models, the task description z is often expressed in natural language (e.g., editable user profiles). After being processed by a Transformer language model encoder, the obtained embeddings are used for the subsequent recommendation task (i.e., the control function ψ is typically implemented using a Transformer) (Gao et al., 2024; Mysore et al., 2023; Kong et al., 2024). The nonlinear mapping ζ is often an FNN (Shen et al., 2021; Chang et al., 2023; Penaloza et al., 2024), which receives the embeddings generated by φ and aligns them with the user s preference patterns. Then, we have: Theorem 5.8. Let F be a function class of embeddingbased controllable learning methods defined by (1), where the control function is a Transformer and the classifier is induced by a three-layer FNN. Let Assumptions 4.1 and 4.2 hold. Given a dataset D of size n. Then, for the surrogate weighted Hamming loss, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f z F: R(f z) b RD(f z) + 3M δ 2n + 24M n + (1 + log2(4en2c2µ2) ln M n 192µV Λ23B3 F (R + 4BWCBWV BWQKLσR3) n , where BF , BWC, BWV , BWQK, Lσ > 0 are constants. Proof. According to Theorem 5.4 and 5.6, 2ρ and C in Theorem 5.1 correspond to 23B3 F and 4BWCBWV BWQKLσR3. This is obvious since the processes of upper bounding these Rademacher complexities are similar. Therefore, we only need to find the upper bounds for ζ and φ corresponding to the specific models in Theorem 5.1 and replace them. Then, according to Theorem 5.1, A := 2ρ(R + C). Finally, the desired bound can be derived by replacing A in Theorem 4.7 with 23B3 F (R + 4BWCBWV BWQKLσR3). Bounds for Hypernetwork-based Controllable Learning In hypernetwork-based controllable recommendation models, the task description z is often represented as a task indicator or a preference weight vector for various metrics. In such cases, a hypernetwork is typically chosen to be an FNN (Chen et al., 2023; Shen et al., 2023; He et al., 2022; Li Generalization Analysis for Controllable Learning et al., 2023; Yan et al., 2022). The hypernetwork takes z as input and outputs model parameters, which replace certain parameters in the main recommendation model (i.e., nonlinear mapping ζ). In this hypernetwork-based paradigm, the method is model-agnostic, meaning the main recommendation model can be any representative architecture, including Transformer-based models for modeling user behavior sequences (Kang & Mc Auley, 2018; Zhou et al., 2020; Wu et al., 2020; de Souza Pereira Moreira et al., 2021). Then, we have the following bound for hypernetwork-based models: Theorem 5.9. Let F be a function class of hypernetworkbased controllable learning methods defined by (1), where the control function is a three-layer FNN and the classifier is induced by a Transformer. Let Assumptions 4.1 and 4.2 hold. Given a dataset D of size n. Then, for the surrogate weighted Hamming loss, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f z F: R(f z) b RD(f z) + 3M δ 2n + 24M n + 192µV (1 + log2(4en2c2µ2) ln M n 4BWCBWV BWQKLσR3Λ + 24µctrlΛB3 F R n , where BF , BWC, BWV , BWQK, Lσ > 0 are constants. Proof. The proof idea is the same as Theorem 5.8, according to Theorem 5.6, 2ρB and C in Theorem 5.2 correspond to 4BWCBWV BWQKLσR3 and 23ΛB3 F R. Then, according to Theorem 5.2, A := 2ρB. However, the introduction of the hypernetwork class leads to an additional last term in Theorem 5.2, hence, for hypernetwork-based methods, ΛA in Theorem 4.7 should actually be 2ρΛB + 2µctrl C. Finally, the desired bound can be derived by replacing ΛA in Theorem 4.7 with 4BWCBWV BWQKLσR3Λ + 24µctrlΛB3 F R. Remark 5.10. Theorems 5.8 and 5.9 show that the order of the bounds for specific embedding-based and hypernetworkbased methods is e O(1/ n). However, their constants are different. The more complex constant term in Theorem 5.8 corresponds to C in Theorem 5.1, while the more complex constant term in Theorem 5.9 corresponds to the first term in Theorem 5.2. They show that to improve the capacity of models, for embedding-based methods, the control function often chooses models with higher complexity, while for hypernetwork-based methods, since the main model is used for learning, the complexity of models selected for the main model is higher than that of the hypernetwork. This can improve the representation ability of models, so it is easier to learn hypotheses with better generalization in the class. These theoretical results are consistent with the actual meth- ods and can provide an explanation for the models selected in the actual methods cited in Subsection 5.3. 6. Discussion To understand controllable learning more clearly and intuitively, we provide two real-world examples of controllable learning scenarios: 1) A news aggregator adjusts article rankings in real time using user-specified rules (e.g., exclude gaming content today) to promote diverse topics. The control function modifies inputs/parameters to enforce filters while keeping recommendations relevant. 2) A trading algorithm adapts portfolio strategies based on real-time goals (e.g., protect capital during market drops) to minimize losses and maximize risk-adjusted returns. The control function tunes parameters dynamically to balance objectives without retraining. These examples show how our framework enables systems to adapt inputs/parameters at test time to meet evolving task requirements (e.g., content preferences, market conditions) while ensuring generalization guarantees for targets like diversity and risk management. Controllable learning and multi-label learning are different learning settings, but there are some connections between them. Specifically, from the perspective of the model output, the outputs of both controllable learning and multi-label learning can be expressed as vector-valued outputs. From the perspective of the model itself and the input, they are different. Controllable learning places more emphasis on task requirements, i.e., the learner can adaptively adjust to dynamically respond to the requirements of different task targets, while multi-label learning does not involve task requirements corresponding to different task targets, so controllable learning is not a special case of multi-label learning. However, when the control function ψ in controllable learning is , controllable learning will degenerate into a specific multi-label learning method, i.e., multi-label learning based on the label-specific representation learning strategy (Zhang & Zhang, 2024b). At this time, the relevant theoretical results can be generalized to multi-label learning scenarios. 7. Conclusion In this paper, we first establish a unified theoretical framework for controllable learning. Then, we propose a novel vector-contraction inequality and derive a tight bound with no dependency on c for general controllable learning classes. In addition, we derive general bounds for two typical controllable learning methods and develop modular theoretical results for commonly used control and prediction functions. In future work, we will extend the analysis to more controllable learning methods, e.g., controllable generation models, and derive theoretical results for a broader range of models to enrich our modular theoretical tool set. Generalization Analysis for Controllable Learning Acknowledgements The authors wish to thank the anonymous reviewers and the area chair for their helpful comments and suggestions, especially the area chair for his dedicated efforts and selfless assistance in clarifying all ambiguities, ensuring the soundness and enhancing the quality of the paper. This work was supported by the National Science Foundation of China (62225602, 62376275). Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. A16z. Big ideas in tech 2025: Knowledge work gets personalized. https://a16z.com/ big-ideas-in-tech-2025/, 2025. Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009. Artificial Intelligence Safety Summit. The Bletchley declaration. https://www.gov.uk/government/publications/aisafety-summit-2023-the-bletchley-declaration/thebletchley-declaration-by-countries-attending-the-aisafety-summit-1-2-november-2023, 2023. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30 (NIPS 2017):6240 6249, 2017. Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019. Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. Chang, J., Zhang, C., Hui, Y., Leng, D., Niu, Y., Song, Y., and Gai, K. Pepnet: Parameter and embedding personalized network for infusing with personalized prior information. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3795 3804, 2023. Chen, S., Wang, Y., Wen, Z., Li, Z., Zhang, C., Zhang, X., Lin, Q., Zhu, C., and Xu, J. Controllable multi-objective re-ranking with policy hypernetworks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3855 3864, 2023. Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008. de Souza Pereira Moreira, G., Rabhi, S., Lee, J. M., Ak, R., and Oldridge, E. Transformers4Rec: Bridging the gap between NLP and sequential/session-based recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems, pp. 143 153, 2021. Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pp. 4171 4186, 2019. Edelman, B. L., Goel, S., Kakade, S. M., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 5793 5831, 2022. Foster, D. J. and Rakhlin, A. ℓ vector contraction for rademacher complexity. ar Xiv:1911.06468v1, 2019. Galanti, T. and Wolf, L. On the modularity of hypernetworks. Advances in Neural Information Processing Systems, 33:10409 10419, 2020. Gao, Z., Zhou, J., Dai, Y., and Joachims, T. End-to-end training for recommendation with language-based user profiles. ar Xiv:2410.18870, 2024. Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. International Conference on Computational Learning Theory, 75(COLT 2018):297 299, 2018. Graf, F., Zeng, S., Rieck, B., Niethammer, M., and Kwitt, R. On measuring excess capacity in neural networks. Advances in Neural Information Processing Systems 35, 2022. He, Y., Zheng, S., Tay, Y., Gupta, J., Du, Y., Aribandi, V., Zhao, Z., Li, Y., Chen, Z., Metzler, D., et al. Hyperprompt: Prompt-based task-conditioning of transformers. In Proceedings of the 39th International Conference on Machine Learning, pp. 8678 8690, 2022. Hieu, N. M., Ledent, A., Lei, Y., and Ku, C. Y. Generalization analysis for deep contrastive representation learning. ar Xiv:2412.12014v2, 2024. Generalization Analysis for Controllable Learning Joachims, T. Keynote: Towards steerable AI systems. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. Kang, W.-C. and Mc Auley, J. Self-attentive sequential recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining, pp. 197 206, 2018. Kong, L., Wang, H., Mu, W., Du, Y., Zhuang, Y., Zhou, Y., Song, Y., Zhang, R., Wang, K., and Zhang, C. Aligning large language models with representation editing: A control perspective. ar Xiv:2406.05954, 2024. Ledent, A., Mustafa, W., Lei, Y., and Kloft, M. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, volume 35, pp. 8279 8287, 2021. Lei, Y., Dogan, U., Zhou, D., and Kloft, M. Data-dependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5):2995 3021, 2019. Lei, Y., Yang, T., Ying, Y., and Zhou, D. Generalization analysis for contrastive representation learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 19200 19227, 2023. Li, X., Yan, F., Zhao, X., Wang, Y., Chen, B., Guo, H., and Tang, R. Hamur: Hyper adapter for multi-domain recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 1268 1277, 2023. Lust-Piquard, F. and Pisier, G. Non commutative khintchine and paley inequalities. Arkiv f or matematik, 29:241 260, 1991. Maurer, A. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory, volume 9925, pp. 3 17, 2016. Mc Diarmid, C. et al. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989. Mustafa, W., Lei, Y., Ledent, A., and Kloft, M. Fine-grained generalization analysis of structured output prediction. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, number IJCAI 2021, pp. 2841 2847, 2021. Mysore, S., Jasim, M., Mc Callum, A., and Zamani, H. Editable user profiles for controllable text recommendations. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 993 1003, 2023. Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. International Conference on Computational Learning Theory, 40(COLT 2015):1376 1401, 2015. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In Proceedings of 6th International Conference on Learning Representations, number ICLR 2018, 2018. Penaloza, E., Gouvert, O., Wu, H., and Charlin, L. Tears: Textual representations for scrutable recommendations. ar Xiv:2410.19302, 2024. Shen, C., Zhang, X., Wei, W., and Xu, J. Hyperbandit: Contextual bandit with hypernewtork for time-varying user preferences in streaming recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 2239 2248, 2023. Shen, C., Zhao, J., Zhang, X., Yu, W., He, M., and Fan, J. Generating model parameters for controlling: Parameter diffusion for controllable multi-task recommendation. ar Xiv:2410.10639, 2024. Shen, C., Zhang, X., Shi, T., Zhang, C., Xie, G., Xu, J., He, M., and Fan, J. A survey of controllable learning: Methods and applications in information retrieval. Frontiers of Computer Science (FCS), 2025. URL https: //doi.org/10.1007/s11704-025-41366-5. Shen, Q., Tao, W., Zhang, J., Wen, H., Chen, Z., and Lu, Q. SAR-Net: A scenario-aware ranking network for personalized fair recommendation in hundreds of travel scenarios. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4094 4103, 2021. Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, volume 23, pp. 2199 2207, 2010. Trauger, J. and Tewari, A. Sequence length independent norm-based generalization bounds for transformers. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238, pp. 1405 1413, 2024. Wu, L., Li, S., Hsieh, C.-J., and Sharpnack, J. SSE-PT: Sequential recommendation via personalized transformer. In Proceedings of the 14th ACM Conference on Recommender Systems, pp. 328 337, 2020. Generalization Analysis for Controllable Learning Wu, L., Ledent, A., Lei, Y., and Kloft, M. Fine-grained generalization analysis of vector-valued learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, number AAAI 2021, pp. 10338 10346, 2021. Xie, G., Zhang, X., Yao, T., and Shi, Y. Bone soups: A seek-and-soup model merging approach for controllable multi-objective generation. ar Xiv:2502.10762, 2025. Yampolskiy, R. On controllability of artificial intelligence. In IJCAI-21 Workshop on Artificial Intelligence Safety (AISafety2021), 2020. Yan, B., Wang, P., Zhang, K., Li, F., Deng, H., Xu, J., and Zheng, B. Apg: Adaptive parameter generation network for click-through rate prediction. Advances in Neural Information Processing Systems 35, pp. 24740 24752, 2022. Zhang, Y. and Zhang, M. Nearly-tight bounds for deep kernel learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 41861 41879, 2023. Zhang, Y. and Zhang, M. Generalization analysis for multilabel learning. In Proceedings of the 41st International Conference on Machine Learning, number ICML 2024, 2024a. Zhang, Y.-F. and Zhang, M.-L. Generalization analysis for label-specific representation learning. Advances in Neural Information Processing Systems, 37:104904 104933, 2024b. Zhou, K., Wang, H., Zhao, W. X., Zhu, Y., Wang, S., Zhang, F., Wang, Z., and Wen, J.-R. S3-Rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1893 1902, 2020. Generalization Analysis for Controllable Learning A. Appendix A.1. Appendix Outline In the appendix, we give the detailed proofs of our theoretical results in the main paper. Our main proofs include: The ℓ Lipschitz continuity of the commonly used losses in controllable learning (Proposition 4.3). The novel vector-contraction inequality for ℓ Lipschitz surrogate loss (Lemma 4.4). The tight bound of the general controllable learning class for the surrogate weighted Hamming loss (Theorem 4.7). The tight bound of the general controllable learning class for the surrogate multi-target bipartite ranking loss (Theorem The upper bound of the Rademacher complexity for embedding-based controllable learning methods (Theorem 5.1). The upper bound of the Rademacher complexity for hypernetwork-based controllable learning methods (Theorem 5.2). The upper bound of the Rademacher complexity for FNN (Theorem 5.4). The upper bound of the Rademacher complexity for Transformer (Theorem 5.6). A.2. Preliminaries A.2.1. DEFINITIONS OF THE CORRESPONDING COMPLEXITY MEASURES Definition A.1 (ℓ norm covering number). Let F be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. For any ϵ > 0, the empirical ℓ norm covering number N (ϵ, F, D) w.r.t. D is defined as the minimal number m of a collection of vectors v1, . . . , vm Rn such that maxi [n] f (xi) vj i ϵ (vj i is the i-th component of the vector vj). In this case, we call v1, . . . , vm an (ϵ, ℓ )-cover of F with respect to D. We also define N (ϵ, F, n) = sup D N (ϵ, F, D). Definition A.2 (Fat-shattering dimension). Let F be a class of real-valued functions mapping from X to R. We define the fat-shattering dimension fatϵ(F) at scale ϵ > 0 as the largest p N such that there exist p points x1, . . . , xp X and witnesses s1, . . . , sp R satisfying: for any δ1, . . . , δp { 1, +1} there exists f F with δi (f(xi) si) ϵ, i = 1, . . . , p. A.2.2. THE BOUND FOR THE LOSS FUNCTION SPACE According to Mc Diarmid s inequality (Mc Diarmid et al., 1989) and the symmetrization technique, it is easy to obtain that for any training dataset D = {(xi, yi) : i [n]}, ℓ( , ) M, with probability at least 1 δ, the following holds: R(ℓ) b RD(ℓ) + 2ˆℜD(L) + 3M A.3. General Bounds for Controllable Learning A.3.1. PROOF OF PROPOSITION 4.3 We first prove that the surrogate weighted Hamming loss is µV -Lipschitz continuous with respect to the ℓ norm. ℓW (f z(x), y) ℓW f z (x), y j=1 vjℓj f z j (x), yj 1 j=1 vjℓj f z j (x), yj Generalization Analysis for Controllable Learning j=1 vj ℓj f z j (x), yj ℓj f z j (x), yj j=1 µ f z j (x) f z j (x) c µc max j [c] f z j (x) f z j (x) =µV f z(x) f z (x) . Then, we prove that the surrogate surrogate multi-target bipartite ranking loss is µ-Lipschitz continuous with respect to the ℓ norm. ℓR(f z(xi, x i), y) ℓR(f z (xi, x i), y) j=1 ℓj f z j (xi) f z j (x i) 1 j=1 ℓj f z j (xi) f z j (x i) ℓj f z j (xi) f z j (x i) ℓj f z j (xi) f z j (x i) j=1 µ f z j (xi) f z j (x i) f z j (xi) f z j (x i) c µc max j [c] f z j (xi) f z j (x i) f z j (xi) f z j (x i) =µ f z(xi, x i) f z (xi, x i) . A.3.2. PROOF OF LEMMA 4.4 Proof Sketch: First, with the refined Dudley s entropy integral inequality (Ledent et al., 2021), the Rademacher complexity of the loss function space L can be bounded by the empirical ℓ norm covering number. Second, according to the Lipschitz continuity w.r.t the ℓ norm, the empirical ℓ norm covering number of F can be bounded by that of P(F). Third, the empirical ℓ norm covering number of P(F) can be bounded by the worst-case Rademacher complexity of P(F) through the fat-shattering dimension. Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space L associated with F can be bounded. We first introduce the following lemmas: Lemma A.3 (Khintchine-Kahane inequality (Lust-Piquard & Pisier, 1991)). Let v1, . . . , vn H, where H is a Hilbert space with being the associated p-th norm. Let ϵ1, . . . , ϵn be a sequence of independent Rademacher variables. Then, for any p 1 there holds i=1 vi 2 # 1 i=1 vi 2 # 1 i=1 vi 2 # 1 Lemma A.4 (Lemma A.2 in (Srebro et al., 2010)). For any function class F, any S with a finite sample of size n and any ϵ > ˆℜS(F), we have that fatϵ(F) 4nˆℜ2 S(F) ϵ2 . Generalization Analysis for Controllable Learning Lemma A.5 (Theorem 12.8 in (Anthony & Bartlett, 2009), (Lei et al., 2023)). If any function in class F takes values in [ B, B], then for any S with a finite sample of size n, any ϵ > 0 with fatϵ(F) < n, we have log N (ϵ, F, S) 1 + d log2 4e Bn dϵ log 4n B2 where d = fatϵ/4(F). Lemma A.6 (Refined Dudley s entropy integral inequality (Ledent et al., 2021)). Let F be a real-valued function class with f B, f F, B > 0, and assume that 0 F. Let S be a finite sample of size n. For any 2 p , we have the following relationship between the Rademacher complexity ˆℜS(F) and the covering number Np(ϵ, F, S). ˆℜS(F) inf α>0 log Np(ϵ, F, S)dϵ Step 1: We first derive the relationship between the empirical ℓ norm covering number N (ϵ, L, D) and the empirical ℓ norm covering number N (ϵ, P(F), [c] D). For the dataset D = {(x1, y1), . . . , (xn, yn)} with n i.i.d. examples: max i |ℓ(f z(xi), yi) ℓ(f z (xi), yi)| µ max i f z(xi) f z (xi) (Use Assumption 4.2) µ max i max j |f z j (xi) f z j (xi) | µ max i max j |pj(f z(xi)) pj(f z (xi)|. (The definition of the projection function class P(F)) Then, according to the definition of the empirical ℓ covering number, we have that an empirical ℓ cover of P(F) at radius ϵ/µ is also an empirical ℓ cover of the loss function space L at radius ϵ, and we can conclude that: N (ϵ, L, D) N µ, P(F), [c] D . (4) Step 2: We show that the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). According to Lemma A.4, for any ϵ > 2ˆℜ[c] D(P(F)), we have fatϵ(P(F)) 4ncˆℜ2 [c] D(P(F)) Then, combining with Lemma A.5, for any ϵ (0, 2E], we have log N (ϵ, P(F), [c] D) 1 + fatϵ/4(P(F)) log2 2 8e E2nc 1 + 64ncˆℜ2 [c] D(P(F)) ϵ2 log2 2 8e E2nc 1 + 64nceℜ2 nc(P(F)) ϵ2 log2 2 8e E2nc Step 3: According to Assumption 4.1 in the main paper, we can obtain the lower bound of the worst-case Rademacher complexity eℜnc(P(F)) by the Khintchine-Kahane inequality with p = 1: = sup [c] D [c] X n ˆℜ[c] D(P(F)) Generalization Analysis for Controllable Learning = sup [c] D [c] X n Eϵ sup pj(f z(xi)) P(F) j=1 ϵijpj(f z(xi)) = sup [c] D [c] X n Eϵ sup f z j Fj j=1 ϵijf z j (xi) = sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵij wj, ζj(φj(xi, ψj(z))) = sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi, ψj(z))) sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ζj(φj(xi, ψj(z))) 2 . (Use Lemma A.3) Since ζj(φj(xi, ψj(z))) 2 A, we set sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] 1 nc h Pn i=1 Pc j=1 ζj(φj(xi, ψj(z))) 2i 1 eℜnc(P(F)) ΛA Step 4: According to Lemma A.6 and combined with the above steps, we have log N (ϵ, L, D)dϵ µ, P(F), [c] D)dϵ (Use inequality (4)) 1 + 64ncµ2eℜ2nc(P(F)) ϵ2 log2 2 2e E2ncµ2 eℜ2nc(P(F)) dϵ (Use inequality (5)) 1 + 64ncµ2eℜ2nc(P(F)) ϵ2 log2 2(4en2c2µ2)dϵ 4α + 12M n + 96µ ceℜnc(P(F)) log2(4en2c2µ2) Z M 12M n + inf α>0 4α + 96µ ceℜnc(P(F)) log2(4en2c2µ2) ln M 12M n + 96µ ceℜnc(P(F))(1 + log2(4en2c2µ2) ln M 24µ ceℜnc(P(F)) ) (Choose α = 24µ ceℜnc(P(F))) 12M n + 96µ ceℜnc(P(F))(1 + log2(4en2c2µ2) ln M n Generalization Analysis for Controllable Learning A.3.3. PROOF OF THEOREM 4.7 Proof Sketch: We upper bound the worst-case Rademacher complexity ℜnc(P(F)), and then combined with Lemma 4.4 and Proposition 4.3 (i.e., substitute µV into µ in Lemma 4.4), the desired bound can be derived. We upper bound the worst-case Rademacher complexity eℜnc(P(F)) as the following: = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f z(xi)) P(F) j=1 ϵijpj(f z(xi)) = sup [c] D [c] X n Eϵ sup f z j Fj j=1 ϵijf z j (xi) = sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵij wj, ζj(φj(xi, ψj(z))) = sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi, ψj(z))) sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi, ψj(z))) 2 (Use Jensen s Inequality) sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ζj(φj(xi, ψj(z))) 2 ΛA nc. (Use Lemma A.3) (7) Then, combining with Proposition 4.3, we have ˆℜD(L) 12M n + 96µV ceℜnc(P(F))(1 + log2(4en2c2µ2) ln M n 12M n + 96µV ΛA(1 + log2(4en2c2µ2) ln M n Combining with (3), then R(f z) b RD(f z) + 24M n + 192µV ΛA(1 + log2(4en2c2µ2) ln M n µE ) n + 3M A.3.4. PROOF OF THEOREM 4.9 Proof Sketch: First, for the surrogate multi-target bipartite ranking loss, by using the U-process technique, we define the empirical Rademacher complexity of a loss function space associated with the controllable learning class F over the set of i.i.d disjoint positive and negative sample pairs for each j-th label, then with two-sided multiplicative Chernoff bound, the generalization error can be bounded by ˆℜD0(L) = Eϵ h supf z F 1 s0 Ps0 i=1 1 c Pc j=1 ϵiℓj f z j (xi) f z j (x i) i , where |D0| = s0 is the number of positive and negative sample pairs that can be constructed for any label. Second, combining with Lemma 4.4 and Proposition 4.3, we have ˆℜD0(L) 12M s0 + 96µ ceℜs0c(P(F)) (1 + log2(4es2 0c2µ2) ln M s0 Finally, we upper bound the worst-case Rademacher complexity eℜs0c(P(F)) 2ΛA s0c, the desired bound can be derived. According to the definitions of bipartite ranking problems in controllable learning in Subection 3.2, for the surrogate multi-target bipartite ranking loss, we have Generalization Analysis for Controllable Learning R(f z) b RD(f z) j pj R(f z|j) X n b RD(f z|j) j pj R(f z|j) X j pj b RD(f z|j) + X j pj b RD(f z|j) X n b RD(f z|j) j pj R(f z|j) b RD(f z|j) + X n ) b RD(f z|j) j pj R(f z|j) b RD(f z|j) + X Rademacher complexity has proved to be a powerful data-dependent measure of hypothesis space complexity. However, since the surrogate multi-target bipartite ranking loss involves pairwise functions, a sequence of pairs of i.i.d. individual observation in (2) is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable. We convert the non-sum-of-i.i.d pairwise function to a sum-of-i.i.d form by using permutations in U-process (Cl emenc on et al., 2008). For each task target, we define the empirical Rademacher complexity of a loss function space associated with the controllable learning class F over the set of i.i.d disjoint positive and negative sample pairs for the j-th label as follows: ˆℜDj(L) = Eϵ|j i=1 ϵiℓR(f z(xj+ i , xj i )) j=1 ϵiℓj f z j (xj+ i ) f z j (xj i ) where each ϵi is an independent Rademacher random variable. The corresponding expected Rademacher complexity is defined as ℜsj(L) = ED|j ˆℜDj(L). We then proof the following lemma: Lemma A.7. Let qτ : X X 7 R be real-valued functions indexed by τ T where T is some set. If x1, . . . , xt and x 1, . . . , x u are i.i.d., s = min{t, u}, then for any convex non-decreasing function ψ, j=1 qτ xi, x j i=1 qτ (xi, x i) Proof. The proof of this lemma is inspired by (Cl emenc on et al., 2008). j=1 qτ xi, x j i=1 qτ xπ(i), x π(i) i=1 qτ xπ(i), x π(i) (ψ is nondecreasing) i=1 qτ xπ(i), x π(i) ! (Jensen s inequality) i=1 qτ (xi, x i) Generalization Analysis for Controllable Learning According to Lemma A.7 and the symmetrization technique, we can obtain R(f z|j) b RD(f z|j) 2 sup f z F i=1 ϵiℓR(f z(xj+ i , xj i )) We denote supf z F 1 sj Psj i=1 ϵiℓR(f z(xj+ i , xj i )) as HDj L . The details of inequality (10) are as follows: R(f z|j) b RD(f z|j) 1 X+ j X j X xj+ i X+ j xj k X j ℓR(f z(xj+ i , xj k )) R(f z|j) ℓR(f z(xj+ πj+(i), xj πj (i))) R(f z|j) ℓR(f z(xj+ πj+(i), xj πj (i))) R(f z|j) πj+,πj sup f z F ℓR(f z(xj+ πj+(i), xj πj (i))) R(f z|j) (ψ is nondecreasing) πj+,πj ED|jψ ℓR(f z(xj+ π(i), xj π(i))) R(f z|j) (Jensen s inequality) ℓR(f z(xj+ i , xj i )) R(f z|j) ℓR(f z(xj+ i , xj i )) ED|jℓR(f z(xj+ i , xj i )) ℓR(f z(xj+ i , xj i )) ED |jℓR(f z(xj+ (D |j is the set of samples with the same distribution as D|j) ℓR(f z(xj+ i , xj i )) ℓR(f z(xj+ (Jensen s inequality) ED|j,D |j,ϵ|jψ i=1 ϵi ℓR(f z(xj+ i , xj i )) ℓR(f z(xj+ 2 sup f z F i=1 ϵiℓR(f z(xj+ i , xj i )) 2ED |j,ϵ|jψ 2 sup f z F i=1 ϵiℓR(f z(xj+ 2 sup f z F i=1 ϵiℓR(f z(xj+ i , xj i )) (D |j with the same distribution as D|j). First, since the maximum difference caused by replacing one element in Dj or ϵi|j is 2M sj , according to Mc Diarmid s Generalization Analysis for Controllable Learning inequality, we have P(|HDj L ℜsj(L)| ε) 2e ε2sj 2M2 . Then, according to the tail bound for sub-Gaussian random variables and Theorem 2.1 in (Boucheron et al., 2013), HDj L is a sub-Gaussian random variable with variance proxy 16 M 2 sj . With the definition of the sub-Gaussian random variable, we have ED|j,ϵ|jet H Dj L e tℜsj (L)+ 8t2M2 sj , t > 0. (11) Then, for any ε > 0, we have R(f z|j) b RD(f z|j) ε|j =P et supfz F|R(f z|j) b RD(f z|j)| etε|j ED|je(t supfz F|R(f z|j) b RD(f z|j)|) etε (Use Markov s Inequality) ED|j,ϵ|je2t H Dj L etε (Use inequality (10) with ψ(x) = etx) e 2tℜsj (L)+ 32t2M2 etε (Use inequality (11)). We set e 2tℜsj (L)+ 32t2M2 sj etε = δ, then we have ε = 2ℜsj(L) + 32t M 2 Hence, we upper bound the term with probability at least 1 δ R(f z|j) b RD(f z|j) 2ℜsj(L) + 32t M 2 2ℜsj(L) + 8M δ sj . (12) Next, we upper bound the term P First, since tj Binomial(n, pj), we have E = npj. With the two-sided multiplicative Chernoff bound, we have P (|tj npj| rnpj) 2e r2npj 3 , r (0, 1). Generalization Analysis for Controllable Learning Then, we have (Use Union Bound Inequality) 2ce r2n minj pj We set 2ce r2n minj pj 3 = δ, then we have r = q δ n minj pj . Hence, the following holds with probability at least 1 δ: δ n minj pj . (13) Since P | tj n pj| rpj P j{| tj n pj| rpj} , according to the proof process of inequality (13), we also have the following holds with probability at least 1 δ: δ n minj pj . Solving the above inequality yields tj npj(1 q δ n minj pj ). Similarly, we have | uj n (1 pj)| (1 pj) q δ n minj pj , and solving this inequality yields uj n(1 pj)(1 q δ n minj pj ). Hence, we have the following holds with probability at least 1 δ: sj = min{tj, uj} min{npj(1 δ n minj pj ), n(1 pj)(1 δ n minj pj )}. (14) In order to ensure that for every j-th label, disjoint positive and negative sample pairs can be constructed, we need to derive the lower bound for minj{sj} to obtain the number of disjoint positive and negative sample pairs that can be constructed for every j-th label. Since sj min j {sj} = min j {min{tj, uj}} min{min j {npj(1 δ n minj pj )}, min j {n(1 pj)(1 δ n minj pj )}} = min{n min j pj(1 δ n minj pj ), n min j (1 pj)(1 δ n minj pj )} 2n min j pj, 1 2n min j (1 pj)} (Assume that n 12 ln 2c δ minj pj ) 2n min{min j pj, min j (1 pj)} 2s0 (Define s0 = n min{min j pj, min j (1 pj)}), Generalization Analysis for Controllable Learning then we have sj 1 2s0 with probability at least 1 δ. According to inequality (12) and Union bound Inequality, we have the following holds with probability at least 1 δ: j pj R(f z|j) b RD(f z|j) X 2ℜsj(L) + 8M Combining inequalities (8), (15), (13) and Union bound Inequality, we have the following holds with probability at least 1 δ: R(f z) b RD(f z) j pj R(f z|j) b RD(f z|j) + X 2ℜsj(L) + 8M δ n minj pj j pjℜsj(L) + 8M X δ n minj pj (16) Next, we transform ℜsj(L) in inequality (16) into ˆℜDj(L), according to Mc Diarmid s inequality, it is easy to obtain that the following holds with probability at least 1 δ: ℜsj(L) ˆℜDj(L) + M With Union bound Inequality, we have the following holds with probability at least 1 δ: X ˆℜDj(L) + M j pj ˆℜDj(L) + M X δ 2sj . (17) Combining inequalities (16), (17), and Union bound Inequality, we have the following holds with probability at least 1 δ: R(f z) b RD(f z) j pj ˆℜDj(L) + 2M X δ 2sj + 8M X δ n minj pj j pj ˆℜD0(L) + 2M X δ s0 + 8M X δ s0 (Use sj 1 2s0 and n min j pj s0) j pj ˆℜD0(L) + 22M X δ s0 , (18) where ˆℜD0(L) = Eϵ h supf z F 1 s0 Ps0 i=1 ϵiℓR(f z(xi, x i)) i = Eϵ h supf z F 1 s0 Ps0 i=1 1 c Pc j=1 ϵiℓj f z j (xi) f z j (x i) i . Generalization Analysis for Controllable Learning Then, according to Lemma 4.4 and Proposition 4.3, we have ˆℜD0(L) 12M s0 + 96µ ceℜs0c(P(F)) (1 + log2(4es2 0c2µ2) ln M s0 We then upper bound the worst-case Rademacher complexity eℜs0c(P(F)) as the following: eℜs0c(P(F)) = sup [c] D0 [c] X s0 ˆℜ[c] D0(P(F)) = sup [c] D0 [c] X s0 Eϵ sup pj(f z(xi)) P(F) j=1 ϵijpj(f z(xi, x i)) = sup [c] D0 [c] X s0 Eϵ sup f z j Fj j=1 ϵij f z j (xi) f z j (x i) = sup ζj(φj(xi,ψj(z))) 2 A:i [s0],j [c] j=1 ϵij wj, ζj(φj(xi, ψj(z))) ζj(φj(x i, ψj(z))) = sup ζj(φj(xi,ψj(z))) 2 A:i [s0],j [c] j=1 ϵij (ζj(φj(xi, ψj(z))) ζj(φj(x i, ψj(z)))) sup ζj(φj(xi,ψj(z))) 2 A:i [s0],j [c] j=1 ϵijζj(φj(xi, ψj(z))) sup ζj(φj(xi,ψj(z))) 2 A:i [s0],j [c] j=1 ϵijζj(φj(xi, ψj(z))) 2 (Use Jensen s Inequality) sup ζj(φj(xi,ψj(z))) 2 A:i [s0],j [c] j=1 ζj(φj(xi, ψj(z))) 2 2ΛA s0c. (Use Lemma A.3) (19) Then, we have ˆℜD0(L) 12M s0 + 192µΛA (1 + log2(4es2 0c2µ2) ln M s0 Combining with (18), then R(f z) b RD(f z) + 24 2Mq s0 + 384 2qµΛA (1 + log2(4es2 0c2µ2) ln M s0 µE ) s0 + 22Mq where q = P B. General Bounds for Typical Controllable Learning Methods B.1. Proof of Theorem 5.1 Proof Sketch: First, according to the definition of ℓ2 norm, we have that the upper bound of φj(xi, ψj(z)) is R + C for any i [n], j [c]. Then, using Jensen s Inequality and Khintchine-Kahane inequality, the desired bound can be derived. First, according to the definitions in Subsection 5.1 and ℓ2 norm, we have Generalization Analysis for Controllable Learning φj(x, ψj(z)) := Concate(x, ψj(z)) := Concate(x, aj) = [x1, . . . , xd, aj 1, . . . , aj d ] i=1 |xi|2 + i =1 |aj i |2 i=1 |xi|2 + i =1 |aj i |2 = x + ψj(z) Since xi R, ψj(z) C for any i [n], j [c], we have that for any i [n], j [c]: φj(xi, ψj(z)) R + C. (20) According to the proof of Theorem 4.7 (i.e., the first five equations in (7)), we have eℜnc(P(F)) = sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi, ψj(z))) (21) Then, we upper bound the worst-case Rademacher complexity eℜnc(P(F)) as the following: = sup ζj(φj(xi,ψj(z))) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi, ψj(z))) sup φj(xi,ψj(z) 2 B:i [n],j [c] j=1 ϵijφj(xi, ψj(z)) (Use ζj is ρ-Lipschitz) sup φj(xi,ψj(z)) 2 B:i [n],j [c] j=1 ϵijφj(xi, ψj(z)) 2 (Use Jensen s Inequality) sup φj(xi,ψj(z)) 2 B:i [n],j [c] j=1 φj(xi, ψj(z)) 2 (Use Lemma A.3) 2ρΛ(R + C) nc . (Use inequality (20)) (22) B.2. Proof of Theorem 5.2 Proof Sketch: Since hypernetwork-based controllable learning method is two-stage, the hypernetwork is used to generate a vector of weights for each task target in the first stage, and then these generated parameter vectors are used in the second stage learning. Therefore, the corresponding whole function class is actually denoted as H = F + Lµctrl G N c. Since the generated parameter vectors in the first stage are actually used as fixed parameters rather than inputs in the second stage, in order to fully consider the capacity of the hypernetwork corresponding to the first stage, it is necessary and appropriate to define the whole function class as the sum of the function classes F + Lµctrl G N c corresponding to the models of these two stages. Then, the upper bound of the worst-case Rademacher complexity of the whole function class is transformed into upper bounding the worst-case Rademacher complexity eℜnc(P(F)) and eℜnc(Lµctrl P(G N c)) respectively. Finally, using Jensen s Inequality and Khintchine-Kahane inequality, the desired upper bounds can be derived respectively. Generalization Analysis for Controllable Learning Since hypernetwork-based controllable learning method is two-stage, the hypernetwork is used to generate a vector of weights for each task target in the first stage, and then the generated parameter vectors are used in the second stage learning. Therefore, the corresponding whole function class is actually denoted as H = F + Lctrl G N c. Then, we have =eℜnc(P(F + Lctrl G N c)) =eℜnc(P(F) + P(Lctrl G N c)) =eℜnc(P(F) + Lctrl P(G N c)) eℜnc(P(F)) + eℜnc(Lctrl P(G N c)). (Use Theorem 12 in (Bartlett & Mendelson, 2002)) Hence, the upper bound of the worst-case Rademacher complexity for hypernetwork-based controllable learning methods is transformed into upper bounding the worst-case Rademacher complexity eℜnc(P(F)) and eℜnc(Lctrl P(G N c)) respectively. We first upper bound the worst-case Rademacher complexity eℜnc(P(F)) as the following: eℜnc(P(F)) = sup ζj(φj(xi;wψj )) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi; wψj)) (Use inequality (21)) sup φj(xi;wψj 2 B:i [n],j [c] j=1 ϵijφj(xi; wψj) (Use ζj is ρ-Lipschitz) sup φj(xi;wψj ) 2 B:i [n],j [c] j=1 ϵijφj(xi; wψj) 2 (Use Jensen s Inequality) sup φj(xi;wψj ) 2 B:i [n],j [c] j=1 φj(xi; wψj) 2 2ρΛB nc . (Use Lemma A.3) (23) We then upper bound the worst-case Rademacher complexity eℜnc(Lctrl P(G N c)) as the following: eℜnc(Lctrl P(G N c)) = sup [c] Z [c] Zn Eϵ sup ℓctrl(ψj(z)) Lctrl Gj j=1 ϵijℓctrl(ψj(zi)) sup ψj(zi) Gj 2µctrl Eϵ j=1 ϵijψj(zi) (Use ℓctrl is µctrl-Lipschitz) sup |ψj(zi)| C:i [n],j [c] j=1 ψj(zi)2 (Use Lemma A.3) 2µctrl C nc . Hence, we can derive the upper bound of the worst-case Rademacher complexity for hypernetwork-based controllable learning methods: eℜnc(P(H)) 2ρΛB nc + 2µctrl C nc . Generalization Analysis for Controllable Learning B.3. Proof of Theorem 5.4 We upper bound the Rademacher complexity ˆℜD(F) of 5 layer FNN as the following: ˆℜD(F) = Eϵ i=1 ϵif (xi) i=1 ϵiw σ(W5σ(W4 σ(W1xi))) ΛEϵ sup σ 1 n i=1 ϵiσ(W5σ(W4 σ(W1xi))) 2ΛEϵ sup W5 BF ,σ i=1 ϵi W5σ(W4 σ(W1xi)) (Re LU activation is 1-Lipschitz) 2ΛBF Eϵ sup σ 1 n i=1 ϵiσ(W4 σ(W1xi)) 22ΛBF Eϵ sup W4 BF ,σ i=1 ϵi W4σ(W3 σ(W1xi)) 22ΛB2 F Eϵ sup σ 1 n i=1 ϵiσ(W3σ(W2σ(W1xi))) 23ΛB2 F Eϵ sup W3 BF ,σ i=1 ϵi W3σ(W2σ(W1xi)) 23ΛB3 F Eϵ sup σ 1 n i=1 ϵiσ(W2σ(W1xi)) 24ΛB3 F Eϵ sup W2 BF ,σ i=1 ϵi W2σ(W1xi) 24ΛB4 F Eϵ sup σ 1 n i=1 ϵiσ(W1xi) 25ΛB4 F Eϵ sup W1 BF i=1 ϵi W1xi 25ΛB5 F Eϵ 1 n i=1 ϵixi 2 # 1 2 (Use Jensen s Inequality) i=1 xi 2 # 1 2 (Use Lemma A.3) 25ΛB5 F R n . B.4. Proof of Theorem 5.6 We first introduce the following lemma: Lemma B.1 (Corollary A.7 in (Edelman et al., 2022)). For vectors θ1, θ2 Rp, softmax (θ1) softmax (θ2) 1 2 θ1 θ2 . Generalization Analysis for Controllable Learning Then, we upper bound the Rademacher complexity ˆℜD(F) of the scalar one layer Transformer as the following: ˆℜD(F) = Eϵ sup w y[CLS] F i=1 ϵiw W C σ W V X i softmax Xi W QKx[CLS] # i=1 ϵi w, W C σ W V X i softmax Xi W QKx[CLS] Bw Eϵ sup WC BWC ,σ i=1 ϵi W C σ W V X i softmax Xi W QKx[CLS] i=1 ϵiσ W V X i softmax Xi W QKx[CLS] n Eϵ sup WV ,WQK i=1 ϵi W V X i softmax Xi W QKx[CLS] 2Bw BWCBWV Lσ n Eϵ sup WQK i=1 ϵi X i softmax Xi W QKx[CLS] 2Bw BWCBWV Lσ n Eϵ sup i,WQK i=1 ϵi X i 2, softmax Xi W QKx[CLS] 1 (Use Px P 2, x 1) 2Bw BWCBWV Lσ n sup i,WQK softmax Xi W QKx[CLS] 1Eϵ i=1 ϵi X i 2, 4Bw BWCBWV Lσ n sup i,WQK Xi W QKx[CLS] Eϵ i=1 ϵi X i 2, (Use Lemma B.1) =4Bw BWCBWV Lσ n sup i,WQK max t xi t W QKx[CLS] Eϵ i=1 ϵi X i 2, 4Bw BWCBWV Lσ n sup i,WQK max t WQKxi t x[CLS] Eϵ i=1 ϵi X i 2, 4Bw BWCBWV BWQKLσ n max i,t xi t x[CLS] Eϵ i=1 ϵi X i 2, 4Bw BWCBWV BWQKLσR2 i=1 ϵi X i 2, =4Bw BWCBWV BWQKLσR2 4Bw BWCBWV BWQKLσR2 i=1 ϵixi t 2 # 1 2 (Use Jensen s Inequality) 4Bw BWCBWV BWQKLσR2 i=1 xi t 2 # 1 2 (Use Lemma A.3) 4Bw BWCBWV BWQKLσR3