# a_characterization_of_multioutput_learnability__9e6a8a6e.pdf Journal of Machine Learning Research 25 (2024) 1-54 Submitted 6/23 ; Revised 6/24; Published 10/24 A Characterization of Multioutput Learnability Vinod Raman vkraman@umich.edu Department of Statistics University of Michigan Ann-Arbor, MI 48109, USA Unique Subedi subedi@umich.edu Department of Statistics University of Michigan Ann-Arbor, MI 48109, USA Ambuj Tewari tewaria@umich.edu Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Ann-Arbor, MI 48109, USA Editor: Gergely Neu We consider the problem of learning multioutput function classes in the batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classification and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting. Keywords: Learnability, Online Learning, Multilabel Classification, Multioutput Regression 1. Introduction Multioutput learning is a problem where an instance is labeled by a vector-valued target. This is a generalization of scalar-valued-target learning settings such as multiclass classification and regression. Multioutput learning has enjoyed a wide range of practical applications like image tagging, document categorization, recommender systems, an weather forecasting to name a few. This widespread applicability has motivated the development of several practical methods (Kapoor et al., 2012; Borchani et al., 2015; Yang et al., 2020; Xu et al., 2013; Nam et al., 2017), as well as theoretical analysis (Koyejo et al., 2015b; Liu and Tsang, 2015). However, the most fundamental question of learnability in a multioutput setting remains unanswered. Characterizing learnability is the foundational step toward understanding any statistical learning problem. The fundamental theorem of statistical learning characterizes the learn- . Equal contribution c 2024 Vinod Raman, Unique Subedi, and Ambuj Tewari. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0787.html. Raman, Subedi, and Tewari ability of binary function class in terms of the finiteness of a combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension (Vapnik and Chervonenkis, 1971, 1974). Extending VC theory, Natarajan (1989) proposed and studied the Natarajan dimension, which was later shown by Ben-David et al. (1995) to characterize learnability in multiclass settings with a finite number of labels. In fact, Ben-David et al. (1995) observed that the learnability of multiclass function class F {e1, . . . , e K}X can also be characterized in terms of the learnability of each component-wise binary function class Fk = {x 7 f(x), ek : f F}, where {e1, . . . , e K} is the standard basis of RK and , is the Euclidean inner-product. Furthermore, they define an equivalence class of loss functions for the 0-1 loss and characterize the learnability of multiclass problems with respect to all losses in the equivalence class. We take a similar approach in this paper. Closing the question of learnability for multiclass problems, Brukhim et al. (2022) shows that the Daniely-Shwartz (DS) dimension, originally proposed by Daniely and Shalev-Shwartz (2014), characterizes multiclass learnability in the infinite labels setting. Similarly, in the online setting, the Littlestone dimension (Littlestone, 1987) characterizes the online learnability of a binary function class and a generalization of the Littlestone dimension (Daniely et al., 2011) characterizes online learnability in the multiclass setting with finite labels. As for scalar-valued regression, the fat shattering (Bartlett et al., 1996) and the sequential fat shattering (Rakhlin et al., 2015a) dimensions characterize batch and online learnability respectively. Surprisingly, to our best knowledge, no such characterization of the learnability of multioutput function classes exists in the literature. In this paper, we close this gap by characterizing the learnability of function classes F YX , where Y RK is vector-valued target space for some K N. Let us define scalarvalued function classes Fk = {x 7 f(x), ek : f F} for each k [K], where {e1, . . . , e K} is the standard basis of RK. Similarly, define Yk := { y, ek : y Y}. Our main result, informally stated below, asserts that F is learnable if and only if each coordinate restriction Fk is learnable. Theorem. ( Informal) A multioutput function class F YX is learnable if and only if each restriction Fk YX k is learnable. We prove a version of this result in four canonical settings: batch classification, online classification, batch regression, and online regression. For the batch settings, we consider the PAC framework and for the online settings, we consider the fully adversarial model. In addition, our result holds for a wide family of loss functions. A unifying theme throughout all four learning settings is our ability to constructively convert a learning algorithm A for F into learning algorithm Ak for Fk for each k {1, ..., K} and vice versa. We show that even for multioutput losses that tightly couple the K coordinates of a function class, their learnability still depends on the learnability of each coordinate. For the batch setting, our algorithmic techniques use the realizable-to-agnostic conversion introduced by Hopkins et al. (2022). In the online setting, we provide a new realizable-to-agnostic conversion similar in the spirit of Hopkins et al. (2022). In principle, both ours and Hopkins et al. (2022) s realizable-to-agnostic conversion is based on the idea of using algorithms to construct a cover of function classes, originally introduced in the seminal work of Ben-David et al. (2009). Multioutput Learnability Our proof techniques, however, do not extend naturally to the case when K is infinite. So, characterizing learnability for an infinite-dimensional target space is an interesting open problem. Moreover, our reductions are computationally inefficient and lead to sub-optimal sample complexities and regret bounds. As such, we leave the construction of efficient and optimal multioutput learning algorithms as an interesting direction for the future work. 1.1 Related works Multilabel classification has been extensively studied in the batch setting. We review a few works here and also refer the reader to the references therein. Dembczy nski et al. (2010) quantify the 0-1 risk of multilabel classifiers trained by minimizing the Hamming loss and vice versa. Dembczy nski et al. (2012) and Chekina et al. (2013) study how exploiting dependencies between labels can improve the predictive performance of multilabel classifiers and how such exploitation interacts with loss minimization. Jain et al. (2016) consider the case where the label set is extremely large and design new loss functions to handle these settings. Busa-Fekete et al. (2022) derive upper and lower bounds on the excess risk for non-parametric and parametric function classes for various loss functions assuming label sparsity. Gao and Zhou (2011) and Koyejo et al. (2015a) study the consistency of surrogate loss functions for multilabel classification. Finally, Gentile and Orabona (2012) consider online multilabel classification under partial feedback and present a novel algorithm based on second-order descent methods. There is also a long history of studying least squares estimators for multioutput linear models in the statistical literature, see (Rao, 1965; Brown and Zidek, 1980) and references therein. The topic received widespread attention in learning theory following the seminal work of Micchelli and Pontil (2005) in RKHS methods for vector-valued regression. We refer the reader to a comprehensive review of kernel methods for vector-valued regression by Alvarez et al. (2012). An early work of Gnecco and Sanguineti (2008) provides estimation and approximation error of vector-valued functions using Rademacher complexity. Following the influential work of Maurer (2016) on Rademacher contraction inequalities for vector-valued functions, there have been works on the Rademacher analysis of vector-valued functions (see Cortes et al. (2016); Reeve and Kaban (2020); Yousefiet al. (2018); Foster and Rakhlin (2019)). Finally, we point out a recent work by Park and Muandet (2023) towards developing empirical process theory for vector-valued functions. 2. Preliminaries Let X denote the instance space and Y RK be the target space for some K N. For a space Z, we let Z be the set of all finite sequences of elements from Z. Consider a vector-valued function class F YX , where YX denotes set of all functions from X to Y. For an unlabeled sample SU X , let F|SU denote the projection of F onto SU. Let , denote the Euclidean inner-product on RK. Define scalar-valued function classes Fk = {x 7 f(x), ek : f F} for each k [K], where {e1, . . . , e K} is the standard basis of RK. Here, each Fk YX k , where Yk = { y, ek : y Y} denotes the restriction of the target space to its kth component. Raman, Subedi, and Tewari For a function f F, we use fk(x) := f(x), ek to denote the kth coordinate output of f(x). On the other hand, we use yk := y, ek to denote kth coordinate of y Y. Additionally, it is useful to distinguish between the range space Y and the image of functions f F. We define the image of function class F as im(F) := f F im(f), where im(f) = {f(x) : x X}. Finally, we take [N] := {1, 2, . . . , N}. In this work, we only consider bounded, non-negative loss functions ℓ: Y Y R 0 that satisfy the identity of the indiscernibles. For the remainder of the paper, we drop the adjectives bounded and non-negative when referring to loss functions. Definition 1 (Identity of Indiscernibles). A loss function ℓ: Y Y R 0 satisfies identity of indiscernibles whenever ℓ(y1, y2) = 0 if and only if y1 = y2. Note that if ℓ1 and ℓ2 are two losses defined on Y Y that satisfy the identity of indiscernibles, then ℓ1(y1, y2) = 0 if and only if ℓ2(y1, y2) = 0. We also define a notion of approximate subadditivity, although not all the loss functions we consider have this property. Definition 2 (c-subadditive). A loss function ℓis c-subadditive if there exists a constant c > 0 that only depends on the loss function ℓsuch that ℓ(y1, y2) c ℓ(y1, y) + ℓ(y, y2) for all y, y1, y2 Y. If |Y| < , ℓbeing c-subadditive is an immediate consequence of ℓsatisfying the identity of indiscernibles. In fact, the value of c in this case is maxr =t ℓ(r,t) minr =t ℓ(r,t) . To see why this is true, it suffices to only consider the case when ℓ(y1, y2) > ℓ(y, y2) because the inequality is trivially true otherwise. Since the loss values are distinct, we must have y = y1. Using the identity of indiscernible, we obtain ℓ(y1, y) minr =t ℓ(r, t), thus implying that c ℓ(y1, y) maxr =t ℓ(r, t) ℓ(y1, y2). The case when |Y| = is a bit delicate because minr =t ℓ(r, t) may not exist. So, one needs extra structure in the loss function to infer c-subadditivity. For instance, if ℓis a distance metric, then it is trivially 1-subadditive due to the triangle inequality. 2.1 Batch Setting In the batch setting, we are interested in characterizing the learnability of F under the classical PAC models: both in the original realizable formulation (Valiant, 1984) and in the agnostic extension (Kearns et al., 1994). Definition 3 (Agnostic Multioutput Learnability). A function class F is agnostic learnable with respect to loss ℓ: Y Y R 0, if there exists a function m : (0, 1)2 N and a learning algorithm A : (X Y) YX with the following property: for every ϵ, δ (0, 1) and for every distribution D on X Y, running algorithm A on n m(ϵ, δ) iid samples from D outputs a predictor g = A(S) such that with probability at least 1 δ over S Dn, ED[ℓ(g(x), y)] inf f F ED[ℓ(f(x), y)] + ϵ. Note that we do not require the output predictor A(S) to be in F, but only require A(S) to compete with the best predictor in F. If we restrict distribution D to a class such that inff F ED[ℓ(f(x), y)] = 0, then we get realizable learnability. Multioutput Learnability The learnability of a function class is generally characterized in terms of the complexity measure of the function class. As stated in the introduction, the VC dimension characterizes the learnability of binary function classes (Vapnik and Chervonenkis, 1974), and so does the Natarajan dimension for multiclass classification (Ben-David et al., 1995). In a scalarvalued regression problem, the fat-shattering dimension of the function class characterizes learnability with respect to the absolute-value and squared loss (Bartlett et al., 1996; Alon et al., 1997). In particular, a real-valued function class G [0, B]X is learnable if and only if its fat-shattering dimension, denoted as fatγ(G), is finite for every scale γ > 0. We extend this characterization to a wide range of loss functions in Lemma 11. Finally, for a realvalued class G, we also use a more general notion of complexity measure called Rademacher complexity, denoted as Rn(G), that provides a sufficient condition for learnability (Bartlett and Mendelson, 2003). Precise definitions of all these complexity measures are provided in Appendix A.1. One recurring theme in this work is to first construct a realizable multioutput learner and then convert it into an agnostic multioutput learner. It is well known that realizable learnability and agnostic learnability are equivalent for multiclass classification problems with respect to 0-1 loss, (see Ben-David et al. (1995), (Shalev-Shwartz and Ben-David, 2014, Theorem 6.7)). Lemma 4, which is an immediate consequence of (Hopkins et al., 2022, Theorem 18), extends this equivalence between realizable and agnostic learning to general loss functions and target spaces. Lemma 4 (Hopkins et al. (2022)). Consider a function class F YX such that |im(F)| < and a general loss function ℓ: Y Y R 0 that is c-subadditive. Then, F is realizable PAC learnable with respect to ℓif and only if F is agnostic PAC learnable with respect to ℓ. The result of (Hopkins et al., 2022, Theorem 18) is stated for the case when |Y| < . However, in the regression setting, we need a slightly general version of their result to handle α-discretized function classes Fα YX (see Proof of Theorem 12) where |im(Fα)| < but |Y| = | [0, 1]K| = . Nevertheless, the proof of Lemma 4 requires only a minor modification to that of (Hopkins et al., 2022, Theorem 18). Given the central role of this result in our characterization, we provide full proof of Lemma 4 in Section 3.1.2. Finally, we note that agnostic-to-realizable conversions in the batch setting are also possible via boosting and compression-based arguments (Montasser et al., 2019; Attias and Hanneke, 2023). We use the conversion of Hopkins et al. (2022) due to its generality and simplicity. 2.2 Online Setting In the online setting, an adversary plays a sequential game with the learner over T rounds. In each round t [T], an adversary selects a labeled instance (xt, yt) X Y and reveals xt to the learner. The learner makes a (potentially randomized) prediction ˆyt Y. Finally, the adversary reveals the true label yt, and the learner suffers the loss ℓ(yt, ˆyt), where ℓis some pre-specified loss function. Given a function class F YX , the goal of the learner is to output predictions ˆyt such that its cumulative loss is close to the best possible cumulative loss over functions in F. A function class is online learnable if there exists an algorithm such that for any sequence of labeled examples (x1, y1), ..., (x T , y T ), the difference in cumulative loss between its predictions and the predictions of the best possible function in F is small. Raman, Subedi, and Tewari Definition 5 (Online Multioutput Learnability). A multioutput function class F is online learnable with respect to loss ℓ, if there exists an (potentially randomized) algorithm A such that for any adaptively chosen sequence of labelled examples (xt, yt) X Y, the algorithm outputs A(xt) Y at every iteration t [T] such that t=1 ℓ(A(xt), yt) inf f F t=1 ℓ(f(xt), yt) where the expectation is taken with respect to the randomness of A and that of the possibly adaptive adversary, and R(T) : N R+ is the additive regret: a non-decreasing, sub-linear function of T. If it is guaranteed that the learner always observes a sequence of examples labeled by some function f F, then we say we are in the realizable setting. On the other hand, if the true label yt is not revealed to the learner in each round t [T] and the adversary only reveals the learner s loss ℓ(A(xt), yt) then we say we are in the bandit setting. The online learnability of scalar-valued function classes H YX has been characterized. For example, when Y is finite (i.e. Y = [K] for some K N), the Multiclass Littlestone Dimension (MCLdim) of H YX characterizes online learnability with respect to the 01 loss. A function class H is online learnable with respect to the 0-1 loss if and only if MCLdim(H) is finite (Daniely et al., 2011). Moreover, MCLdim(H) tightly captures the best achievable regret in both the realizable and agnostic settings (Daniely et al., 2011). When the label space Y is a bounded subset of R, the sequential fat-shattering dimension of a real-valued function class H RX , denoted fatseq γ (H), characterizes the online learnability of H with respect to the absolute value loss (Rakhlin et al., 2015a). Unlike the Littlestone dimension, note that the sequential fat-shattering dimension is a scale-sensitive dimension. That is, fatseq γ (H) is defined at every scale γ > 0. Accordingly, a real-valued function class H RX is learnable with respect to the absolute value loss if and only if its sequential fat-shattering dimension is finite at every scale γ > 0 (Rakhlin et al., 2015a). We extend this characterization to a wide range of loss functions in Lemma 22. Beyond scalar-valued learnability, for any label space Y, function class H YX , and loss function ℓ: Y Y R 0, the sequential Rademacher complexity of the loss class ℓ H, denoted Rseq T (ℓ H), is a useful tool for providing sufficient conditions for online learnability (Rakhlin et al., 2015a). See Appendix A.2 for complete definitions. Like the batch setting, a key technique we use to prove online learnability is to first construct a realizable online learner and then convert it into an agnostic online learner. However, unlike the batch setting, there is no known generic algorithm that converts a (potentially randomized) realizable online learner into an agnostic online learner. Thus, one of the contributions of this work is Theorem 13, informally stated below, which provides an online analog of the realizable-to-agnostic conversion from Hopkins et al. (2022). Theorem. (Informal) Let F YX be a multioutput function class such that |im(F)| < and ℓ: Y Y R 0 be any bounded, c-subadditive loss function. If F is online learnable with respect to ℓin the realizable setting, then F is online learnable with respect to ℓin the agnostic setting. Multioutput Learnability 3. Batch Multioutput Learnability 3.1 Batch Multilabel Classification In this section, we study the learnability of batch multilabel classification. Accordingly, let Y = { 1, 1}K. First, we consider the learnability of a natural decomposable loss. Then, we extend the result to more general non-decomposable losses that satisfy the identity of indiscernible. We want to point out that a multilabel classification with K labels can be viewed as a multiclass classification with 2K labels. With this viewpoint, the Natarajan dimension of F continues to characterize the batch multilabel learnability for any loss satisfying the identity of indiscernibles (see (Ben-David et al., 1995, Section 4)). For the sake of completeness, we also provide proof of this characterization in Appendix B. However, it is more natural to view a multilabel classification as K different binary classification problems as opposed to a multi-class classification problem with 2K labels. Exploiting this natural decomposability of a multilabel function class, we relate the learnability of F to the learnability of each component Fk. 3.1.1 Characterizing Batch Learnability for the Hamming Loss A canonical and natural loss function for multilabel classification is the Hamming loss, defined as ℓH(f(x), y) := PK i=1 1 fi(x) = yi , where f(x) = (f1(x), . . . , f K(x)) and y = (y1, . . . , y K). The following result establishes an equivalence between the learnability of F with respect to Hamming loss and the learnability of each Fk with respect to 0-1 loss. Theorem 6. A function class F YX is agnostic PAC learnable with respect to ℓH if and only if each of Fk YX k is agnostic PAC learnable with respect to the 0-1 loss. Proof We first prove that learnability of each component is sufficient followed by the proof of necessity. Part 1: Sufficiency. Here our goal is to prove that the agnostic PAC learnability of each Fk is sufficient for agnostic PAC learnability of F. Our proof is constructive: given oracle access to agnostic PAC learners Ak for each Fk with respect to 0-1 loss, we construct an agnostic PAC learner A for F with respect to ℓH. Let D be arbitrary distribution on X Y and S = {(xi, yi)}n i=1 Dn be iid samples from distribution D. Denote Dk to be the marginal distribution of D restricted to X Yk. Then, for all k [K], the marginal samples Sk = {(xi, yk i )}n i=1 with scalar-valued targets are iid samples from Dk. For each k [K], define hk = Ak(Sk) to be the hypothesis returned by algorithm Ak when trained on Sk. Let mk(ϵ, δ) denote the sample complexity of Ak. Since Ak is an agnostic PAC learner for Fk, we have that for n maxk mk( ϵ K ), with probability at least 1 δ/K over samples Sk Dn k, EDk h 1 n hk(x) = ykoi inf fk Fk EDk h 1 n fk(x) = ykoi + ϵ Summing these risk bounds over all coordinates k and using union bounds over probabilities, we get that with probability at least 1 δ over samples S Dn, we obtain PK k=1 EDk 1 hk(x) = yk PK k=1 inffk Fk EDk 1 fk(x) = yk + ϵ. Now using the fact Raman, Subedi, and Tewari that F F1 . . . FK followed by the linearity of expectation gives k=1 1 n hk(x) = yko# k=1 1 n fk(x) = yko# This completes our proof as it shows that the learning rule that concatenates the predictors returned by each Ak on the marginalized samples Sk is an agnostic PAC learner for F with respect to ℓH with sample complexity at most maxk mk(ϵ/K, δ/K). Part 2: Necessity. Next, we show that if F is learnable with respect to ℓH, then each Fk is PAC learnable with respect to the 0-1 loss. Our proof is again based on reduction: given oracle access to agnostic PAC learner A for F, we construct an agnostic PAC learner A1 for F1. A similar construction can be used for all other Fk s. Let D1 be arbitrary distribution on X Y1 and S = {(xi, y1 i )}n i=1 be iid samples from D1. In order to use the algorithm A, we first augment the samples S to create samples with K-variate target. Define an augmented sample e S = {(xi, (y1 i , . . . , y K i ))}n i=1 such that yik { 1, 1} each with probability 1/2 for all i [n] and k {2, . . . , K} . Next, we run A on e S and obtain the hypothesis h = (h1, . . . , h K) = A(e S). We now show that h1 obtains agnostic PAC bounds. Consider a distribution e D on X Y such that a sample (x, (y1, . . . , y K)) from e D is obtained by first sampling (x, y1) D1 and appending yk s sampled independently from uniform distribution on { 1, 1} for each k {2, . . . , K}. Let m(ϵ, δ, K) denote the sample complexity of A. Since A is an agnostic PAC learner for F, for n m(ϵ, δ, K), with probability at least 1 δ, we have k=1 1 n hk(x) = yko# inf f F E e D k=1 1 n fk(x) = yko# For k 2, since the target is chosen uniformly at random from { 1, 1}, the 0-1 risk of any predictor is 1/2. Therefore, the expression above can be written as ED1 1 h1(x) = y1 + k=2 1/2 inf f F ED1 1 f1(x) = y1 + which reduces to ED1[1 h1(x) = y1 ] inff1 F1 ED1[1 f1(x) = y1 ] + ϵ. Therefore, F1 is agnostic PAC learnable with respect to 0-1 loss with sample complexity at most m(ϵ, δ, K). 3.1.2 Characterizing Batch Learnability for General Losses In this section, we characterize the learnability for general multilabel losses. Our main technical tool in characterizing the learnability for general loss functions is the equivalence between realizable and agnostic learning guaranteed by Lemma 4. Thus, we first provide the proof of that lemma before we proceed further. Proof (of Lemma 4) Note that agnostic learnability implies realizable learnability by definition. So, it suffices to show that realizable learnability of F with respect to ℓimplies Multioutput Learnability agnostic learnability. Our proof here is constructive. That is, given a realizable algorithm A for F, we provide an algorithm, stated as Algorithm 1, that constructs an agnostic learner for F. Algorithm 1 Agnostic learner for F with respect to ℓ Input: Realizable learner A for F with respect to ℓ, unlabeled samples SU DX , and different labeled samples SL D independent from SU 1 Run A over all possible labelings of SU by F to construct a concept class C(SU) := A SU, f(SU) | f F|SU . 2 Return ˆg C(SU) with the lowest empirical error over SL with respect to ℓ. Let D be an arbitrary distribution over X Y. Define f := inf f F ED[ℓ(f(x), y)] to be the optimal predictor in F. Now consider a predictor g = A(SU, f (SU)) C(SU) returned by A when trained on samples SU labeled by f . Note that g exists in C(SU) because we consider all possible labelings of SU by F and there must be a sample labeled by f as well. Let m A(ϵ, δ, K) be the sample complexity of the algorithm A. Since A is a realizable learner for F, for |SU| m A ϵ 2, K with probability 1 δ/2 over sample SU DX , we have EDX [ℓ(g(x), f (x))] ϵ where DX is the marginal distribution of D restricted to X and c is the subaddtivity constant of ℓ. Since the loss function is c-subadditive, for any (x, y) X Y, we have ℓ(g(x), y) ℓ(f (x), y) + c ℓ(g(x), f (x)) pointwise. Taking expectation with respect to (x, y) D, we obtain ED[ℓ(g(x), y)] c EDX [ℓ(g(x), f (x))] + ED[ℓ(f (x), y)] ED[ℓ(f (x), y)] + ϵ where the inequality holds with probability 1 δ/2. Thus, we have shown that there exists a predictor g C(SU) that achieves agnostic PAC bounds for F with respect to ℓ. Let ℓ( , ) b be the upper bound on ℓ. Recall that by Hoeffding s inequality and union bound, with probability at least 1 δ/2 over sample SL D, the empirical risk of every hypothesis in C(SU) on a sample of size 8b2 ϵ2 log 4|C(SU)| δ is at most ϵ/4 away from its population risk. So, if |SL| 8b2 ϵ2 log 4|C(SU)| δ , then with probability at least 1 δ/2 over sample SL D, we have (x,y) SL ℓ(g(x), y) ED [ℓ(g(x), y)] + ϵ 4 ED[ℓ(f (x), y)] + 3ϵ Next, consider the predictor ˆg returned by Algorithm 1. Since it is an empirical risk minimizer, its empirical risk can be at most the empirical risk of g. Given that the population risk of ˆg can be at most ϵ/4 away from its empirical risk, we have that ED [ℓ(ˆg(x), y)] ED[ℓ(f (x), y)] + 3ϵ 4 inf f F ED[ℓ(f(x), y)] + ϵ, Raman, Subedi, and Tewari where the second inequality above uses the definition of f . Note that this inequality holds with probability at least 1 δ, where the probability is taken over both samples SU and SL. Thus, we have shown that Algorithm 1 is an agnostic PAC learner for F with respect to ℓ. We now upper bound the sample complexity of Algorithm 1, denoted m(ϵ, δ, K) hereinafter. Note that m A(ϵ, δ, K) is at most the number of unlabeled samples required for the realizable algorithm A to succeed plus the number of labeled samples for the ERM step to succeed. Thus, m(ϵ, δ, K) m A ϵ2 log 4|C(SU)| 2, K log (|im(F)|) + log 4 where the second inequality follows due to |C(SU)| |im(F)||SU| and we need |SU| to be of size m A ϵ With Lemma 4 in hand, we can now relate the learnability of H with respect to any ℓ satisfying identity of indiscernibles to the learnability of H with respect to ℓH. To that end, we prove a result establishing the equivalence of learnability between any two loss functions satisfying the identity of indiscernibles. Lemma 7. Let ℓand ℓ be any two loss functions satisfying the identity of indiscernibles. Then, a function class F YX is agnostic PAC learnable with respect to ℓif and only if F YX is agnostic PAC learnable with respect to ℓ . The key idea behind the proof of Lemma 7 is to use a realizable learner for ℓto construct a realizable learner for ℓ . This is possible because ℓ (y1, y2) = 0 if and only if ℓ(y1, y2) = 0 for any y1, y2 Y. Given such realizable learner for ℓ , Lemma 4 guarantees the existence of an agnostic learner for ℓ . Proof Since ℓand ℓ are arbitrary, it suffices to prove only one direction. So, let us assume that F is learnable with respect to ℓ. We will now show that F is learnable with respect to ℓ as well. First, we show this for any realizable distribution D with respect to ℓ . Since, for any y1, y2 Y, we have ℓ (y1, y2) = 0 if and only if ℓ(y1, y2) = 0, the distribution D is also realizable with respect to ℓ. Furthermore, as there are at most 22K distinct possible inputs to ℓ ( , ), the loss function can only take a finite number of values. So, we can always find universal constants a > 0 and b > 0 (that only depends on ℓand ℓ ) such that aℓ ℓ bℓ. Given that F is learnable with respect to ℓ, there exists a learning algorithm A with the following property: for any ϵ, δ > 0, for S Dn, such that n = m A( ϵ b, δ, K), the algorithm outputs a predictor h = A(S) such that, with probability 1 δ over S Dn, we have ED[ℓ(h(x), y)] ϵ b. This inequality upon using the fact that ℓ (h(x), y) bℓ(h(x), y) pointwise reduces to ED[ℓ (h(x), y)] ϵ. Therefore, any realizable learner A for ℓis also a realizable learner for ℓ . Finally, as ℓ satisfies the identity of indiscernibles and thus c-subadditivity, Lemma 4 guarantees the existence of agnostic PAC learner B for F with respect to ℓ . In particular, one such agnostic PAC learner B is Algorithm 1 that has sample Multioutput Learnability m B(ϵ, δ, K) m A 2, K K + log 1 where c > 1 is the subadditivity constant of ℓ and m A( , , K) is the sample complexity of any realizable algorithm A. An immediate consequence of Lemma 7 is that learnability with respect to any loss ℓsatisfying the identity of indiscernibles is equivalent to learnability with respect to the Hamming loss ℓH. Thus, given Theorem 6, we can deduce the following result. Theorem 8. Let ℓbe any multilabel loss function satisfying the identity of indiscernibles. A function class F YX is agnostic PAC learnable with respect to ℓif and only if each restriction Fk YX k is agnostic PAC learnable with respect to the 0-1 loss. Remark. Since the learnability of a binary function class with respect to the 0-1 loss is characterized by its VC dimension (Shalev-Shwartz and Ben-David, 2014, Theorem 6.7), Theorem 8 implies that F is learnable with respect to ℓsatisfying the identity of indiscernibles if and only if VC(Fk) < for each k [K]. 3.2 Batch Multioutput Regression In this section, we consider the case when Y = [0, 1]K RK for K N. For bounded targets (with a known bound), this target space is without loss of generality because one can always normalize each Yk to [0,1] by subtracting the lower bound and dividing by the upper bound of Yk. As usual, we consider an arbitrary multioutput function class F YX . Following our outline in classification, we first study the learnability of F under decomposable losses and then non-decomposable losses. 3.2.1 Characterizing Learnability for Decomposable Losses A canonical loss for the scalar-valued regression is the absolute value metric, d1(fk(x), yk) := |fk(x) yk|. Analogously, we define dp(fk(x), yk) := |fk(x) yk|p for p > 1 are other natural scalar-valued losses. For multioutput regression, we consider decomposable losses that are natural multivariate extensions of the d1 metric. In particular, we consider loss functions with the following properties. Assumption 1. The loss can be written as ℓ(f(x), y) = PK k=1 ψk d1(fk(x), yk) where for each k [K], the function ψk : R 0 R 0 is L-Lipschitz and satisfies ψk(0) = 0. Here, ψk d1 is a composition function defined as ψk d1(fk(x), yk) := ψk d1(fk(x), yk) . Note that ψk d1 is a large family of loss functions that effectively contains all natural decomposable multioutput regression losses. For instance, taking ψk(z) = |z|p for p 1 gives ℓp norms raised to their p-th power. Considering ψk(z) = |z|2/2 1[|z| δ] + δ(|z| δ/2)1[|z| > δ] for some 1 > δ > 0 yields multivariate extension of Huber loss used for robust regression. One may also construct a multioutput loss by considering different scalar-valued Raman, Subedi, and Tewari losses for each coordinate output. Next, we establish an equivalence between the learnability of F YX with respect to ℓand the learnability of each Fk with respect to the loss ψk d1. Theorem 9. Let ℓbe any loss function that satisfies Assumption 1. Then, a function class F YX is agnostic learnable with respect to ℓif and only if each of Fk YX k is agnostic learnable with respect to ψk d1. Proof The proof of the sufficiency direction is similar to that of Theorem 6, so we defer it to Appendix C.1. We now focus on the necessity direction. To that end, we show that if F is agnostic learnable with respect to ℓ, then each Fk is agnostic learnable with respect to ψk d1. In particular, given oracle access to agnostic learner A for F, we construct agnostic learner A1 for F1. By symmetry, a similar reduction can then be used to construct an agnostic learner for each component Fk. Since we are given a sample with a single, univariate target, the main problem is to find the right way to augment samples to a K-variate target. In the proof of Theorem 6, we showed that randomly choosing yik Uniform({ 1, 1}) for k 2 results in all predictors having a constant 1/2 risk leaving only the risk of the first component on both sides. Unfortunately, in regression under general losses, no single augmentation works for every distribution on X. Thus, we augment the samples by considering all possible behaviors of (F2, . . . , FK) on the sample. Since the function class maps to a potentially uncountably infinite space, we first discretize each component of the function class and consider all possible labelings over the discretized space. Fix 1 > α > 0. For k 2, define the discretization fα k (x) = f(x) for every fk Fk and the discretized component class Fα k = {fα k |fk Fk}. Note that a function in Fk maps to {0, α, 2α, . . . , 1/α α} and the size of the range of the discretized function class Fα k is 1 + 1/α (α + 1)/α 2/α. For the convenience of exposition, let us define Fα 2:K to be Fα without the first component, and we denote fα 2:K to be an element of Fα 2:K. Algorithm 2 Agnostic learner for F1 with respect to ψ1 d1 Input: Agnostic learner A for F, samples S = (x1:n, y1 1:n) Dn 1 , and another independent samples e S from D1 1 Define Saug = {(x1:n, y1 1:n, fα 2:K(x1:n) | fα 2:K Fα 2:K}, all possible augmentations of S by Fα 2:K. 2 Run A over all possible augmentations to get C(S) := A Sa | Sa Saug . 3 Define C1(S) = {g1 | (g1, . . . , gk) C(S)}, a restriction of C(S) to its first coordinate output. 4 Return ˆg1, the predictor in C1(S) with the lowest empirical error over e S with respect to ψ1 d1. We now show that Algorithm 2 is an agnostic learner for F1. First, let us define f 1 := arg min f1 F1 ED1[ψ1 d1(f1(x), y1)], Multioutput Learnability to be optimal predictor in F1 with respect to D1. By definition of F1, there must exist f 2:K F2:K such that (f 1 , f 2:K) F. We note that f k need not be optimal predictors in Fk for k 2, but we use the notation just to associate these component functions with the first component function f 1 . Define f ,α 2:K Fα 2:K to be the corresponding discretization of f 2:K. At a high level, the key idea of this proof is to show that the algorithm A when run on the sample (x1:n, y1 1:n, f ,α 2:K(x1:n) produces a predictor g = A(x1:n, y1 1:n, f ,α 2:K(x1:n)) such that its restriction g1 is a valid agnostic learner for F1. Although one such augmentation is enough to produce an agnostic learner for F1, all possible augmentations are required in step 2 of Algorithm 2 because f 1 is not known to the learner apriori. We now make this argument precise. Suppose g = A((x1:n, y1 1:n, f ,α 2:K(x1:n)) is the predictor obtained by running A on the sample augmented by f ,α 2:K. Note that g C(S) by definition. Let m A(ϵ, δ, K) be the sample complexity of A. Since A is an agnostic learner for F with respect to ℓ, we have that for n m A(ϵ/4, δ/2, K), with probability at least 1 δ/2, ED1 h ψ1 d1(g1(x), y1) i + k=2 EDX ψk d1(gk(x), f ,α k (x)) ED1 ψ1 d1(f1(x), y1) + k=2 EDX ψk d1(fk(x), f ,α k (x)) ! Note that the quantity on the left is trivially lower bounded by the risk of the first component. To handle the right-hand side, we first note that the optimal risk is trivially upper bounded by the risk of (f 1 , f 2:K), yielding ED1 ψ1 d1(g1(x), y1) ED1 ψ1 d1(f 1 (x), y1) + k=2 EDX ψk d1(f k(x), f ,α k (x)) + ϵ Next, using the L-Lipschitzness of ψk and the fact that ψk(0) = 0 implies ψk d1(f k(x), f ,α k (x)) L d1(f k(x), f ,α k (x)) Lα for all k 2. Thus, picking α = ϵ 4LK and using the definition of f 1 , we obtain ED1 ψ1 d1(g1(x), y1) inf f1 F1 ED1[ψ1 d1(f1(x), y1)] + ϵ Therefore, we have shown the existence of one predictor g C(S) such that its restriction to the first component, g1, obtains the agnostic bound. Note that since ψ1 is L-Lipschitz and satisfies ψ1(0) = 0, we obtain that ψ1 d1( , ) L. The upper bound also uses the fact that |f1(x) y1| 1. Now recall that by Hoeffding s Inequality and union bound, with probability at least 1 δ/2, the empirical risk of every hypothesis in C1(S) on a sample of size 8L2 ϵ2 log 4|C1(S)| δ is at most ϵ/4 away from its true error. So, if |e S| 8L2 ϵ2 log 4|C1(S)| δ , then with probability at least 1 δ/2, the empirical risk of the predictor g1 is (x,y1) e S ψ1 d1(g1(x), y1) ED1 ψ1 d1(g1(x), y1) + ϵ 4 inf f1 F1 ED1[ψ1 d1(f1(x), y1)]+3ϵ Raman, Subedi, and Tewari Since ˆg1, the output of Algorithm 2 is the ERM on e S over C1(S), its empirical risk can be at most the empirical risk of g1, which is at most inff1 F1 ED1[ψ1 d1(f1(x), y1)] + 3ϵ 4 . Given that the population risk of ˆg1 is at most ϵ/4 away from its empirical risk, we can conclude that the population risk of ˆg1 is ED1[ψ1 d1(ˆg1(x), y1)] inf f1 F1 ED1[ψ1 d1(f1(x), y1)] + ϵ. Applying union bounds, the entire process, algorithm A on the dataset augmented by f ,α 2:K and ERM in step 4, succeeds with probability 1 δ. The sample complexity of Algorithm 2 is the sample complexity of Algorithm A and the sample complexity of ERM in step 4, which is m A(ϵ/4, δ/2, K) + 8L2 ϵ2 log 4|C1(S)| m A(ϵ/4, δ/2, K) + 8KL2 m A(ϵ/4, δ/2, K) log 4K L where the second inequality follows due to |C1(S)| (2/α)m A(ϵ/4,δ/2,K) K is the required size of C1(S) and our choice of α = ϵ/(4KL). This completes the proof as we have shown that Algorithm 2 is an agnostic learner for F1 with respect to ψ1 d1. 3.2.2 A More General Characterization of Learnability for Decomposable Losses Unlike Theorems 6 and 8 in classification setting where we connected the learnability of F to the learnability of Fk s with respect to 0-1 loss, Theorem 9 relates the learnability of F to the learnability of Fk with respect to ψk d1 instead of the more canonical loss d1. In this section, we complete that final step to characterizing learnability in terms of d1 with an additional assumption on ψk. Assumption 2. For all k [K], the function ψk : R 0 R 0 is monotonic. Under these assumptions, Theorem 10 provides a more general characterization than Theorem 9. Theorem 10. Let ℓbe any loss function that satisfies Assumptions 1 and 2. Then, a multioutput function class F ([0, 1]K)X is agnostic learnable with respect to ℓif and only if each Fk [0, 1]X is agnostic learnable with respect to d1. Since the fat-shattering dimension of a real-valued function class characterizes its learnability with respect to d1 loss, Theorem 10 implies that a multioutput function class F is learnable with respect to ℓsatisfying Assumptions 1 and 2 if and only if fatγ(F) < for every fixed scale γ > 0. Theorem 10 is an immediate consequence of Theorem 9 and the following lemma, the proof of which is deferred to Appendix C.2. Lemma 11. Let Y = [0, 1] be the label space and ψ : R 0 R 0 be any Lipschitz and monotonic function that satisfies ψ(0) = 0. A scalar-valued function class G [0, 1]X is agnostic learnable with respect to ψ d1 if and only if G is agnostic learnable with respect to d1. Multioutput Learnability The part of the lemma showing that d1 learnability implies ψ d1 is trivial using the Rademacher-based argument and Talagrand s contraction lemma. However, proving ψ d1 learnability implies d1 learnability is non-trivial. The case ψ(z) = |z|2 is considered in (Anthony and Bartlett, 1999, Theorem 19.5), but their proof requires a mismatch between the label space and the prediction space, namely Y = [ 1, 2] but G [0, 1]X . In this work, we improve their result by showing the equivalence between ψ d1 learnability and d1 learnability without requiring extended label space. An application of Lemma 11 is that the learnability of a real-valued function class G with respect to losses d1 and dp are equivalent for any p > 1. 3.2.3 Characterizing Learnability for Non-Decomposable Losses Next, we study the learnability of function class F with respect to non-decomposable losses. In the regression setting, the natural non-decomposable loss to consider is ℓp norm, which is defined as ℓp(f(x), y) := P k=1 |fk(x) yk|p 1/p for 1 p < . For p = , the p-norm is defined as ℓ (f(x), y) := maxk |fk(x) yk|. One might be interested in ℓp norms instead of their decomposable counterparts ℓp p losses discussed in the previous section mainly for robustness to outliers. The following result characterizes the agnostic learnability of F with respect to ℓp norms. Theorem 12. Fix p 1. The function class F YX is agnostic learnable with respect to ℓp norm if and only if each of Fk YX k is agnostic learnable with respect to the absolute value loss, d1. Using ψk(z) = |z| in Theorem 9 implies that F is learnable with respect to ℓ1 if and only if each Fk is learnable with respect to d1. Thus, to prove Theorem 12, it suffices to show that for all p > 1, F is learnable with respect to ℓp if and only if F is learnable with respect to ℓ1 norm. Proof We begin by proving the sufficiency direction. As discussed above, the learnability of each Fk with respect to d1 implies that F is learnable with respect to ℓ1. Then, the highlevel idea of the proof is to use an agnostic learner for F with respect to ℓ1 to construct a realizable learner for Fα with respect to ℓ1. Using the fact ℓ1(y1, y2) = 0 if and only if ℓp(y1, y2) = 0 for any y1, y2 Y, the realizable learner for ℓ1 is also a realizable learner for ℓp. Finally, as |im(Fα)| < for every α > 0, Lemma 4 guarantees the existence of an agnostic learner for Fα with respect to ℓp. Then, a simple application of the triangle inequality shows that an agnostic learner for Fα is also an agnostic learner for F with respect to ℓp. We now proceed with the formal proof. Part 1: Sufficiency. Fix p > 1. Recall that the learnability of each Fk with respect to d1 implies that F is learnable with respect to ℓ1. Let D be arbitrary distribution on X Y and A be an agnostic PAC learner for F with respect to ℓ1 with sample complexity m(ϵ, δ) . For any ϵ, δ > 0, with n sufficiently larger than m(ϵ/2, δ, K), with probability at least 1 δ over S Dn, we have ED[ℓ1(g(x), y)] inf f F ED[ℓ1(f(x), y)] + ϵ where g = A(S). Define Fα to be a discretized function class obtained by discretizing F component-wise using the scheme (1). Consider D to be a realizable distribution Raman, Subedi, and Tewari with respect to Fα. Note that the triangle inequality implies ℓ1(f(x), y) ℓ1(fα(x), y) + ℓ1(f(x), fα(x)) ℓ1(fα(x), y)+Kα. Taking α = ϵ 2K and using the fact that inff F E[ℓ1(fα(x), y)] = 0, we obtain ED[ℓ1(g(x), y)] ϵ. Next, using the inequality ℓp(g(x), y) ℓ1(g(x), y) pointwise yields ED[ℓp(g(x), y)] ϵ. Therefore, A is a realizable learner for Fα with respect to ℓp with sample complexity m(ϵ/2, δ, K). Since |im(Fα)| < and ℓp( , ) are 1-subadditive, Lemma 4 implies that Fα is agnostic learnable with respect to ℓp via Algorithm 1, referred to as algorithm B henceforth. Thus, for any ϵ, δ > 0, there exists a n m B(ϵ/2, δ/2), for any distribution e D on X Y, running B on S Dn outputs a predictor g YX such that with probability at least 1 δ over S e Dn, we have E e D[ℓp( g(x), y)] inf fα Fα E e D[ℓp(fα(x), y)] + ϵ 2 = inf f F E e D[ℓp(fα(x), y)] + ϵ Using triangle inequality, we have ℓp(fα(x), y) ℓp(f(x), y)+ℓp(fα(x), f(x)) ℓp(f(x), y)+ αK pointwise. Again taking α = ϵ 2K , we obtain E e D[ℓp( g(x), y)] inf f F E e D[ℓp(f(x), y)] + ϵ. Therefore, we have shown that F is agnostic PAC learnable with respect to ℓp. The sample complexity of agnostic learner B can be made precise using the sample complexity of Algorithm 1. In particular, the sample complexity of B is the sample complexity of the realizable learner A and the sample complexity of the ERM in step 2 of Algorithm 1. Proof of Lemma 4 shows that the sample complexity of B must be m B(ϵ, δ, K) m A 2, K log |(im(Fα)|) + log 4 where b is the upperbound on the loss and c is the subadditivity constant. Since b K, and c = 1 for all ℓp norms with p 1, we obtain m B(ϵ, δ, K) m A(ϵ/2, δ/2, K) + 8K3 m A(ϵ/2, δ/2, K) log 4K where we also use the fact that |im(Fα)| (2/α)K for our choice of α = ϵ 2K . Part 2: Necessity. Fix p > 1. We now prove that F being learnable with respect to ℓp implies F is learnable with respect to ℓ1. The proof is identical to the proof of sufficiency, so we only provide a sketch of the argument here. Our proof strategy follows a similar route through realizable learnability of the discretized class Fα and then the use of Lemma 4. Recall that any agnostic learner A for F with respect to ℓp is a realizable learner for Fα with respect to ℓp. Using the inequality ℓ1( , ) Kℓp( , ) pointwise, we can deduce that A is also a realizable learner for Fα with respect to ℓ1. Since |im(Fα)| (2/α)K < , Lemma 4 guarantees existence of an agnostic learner for Fα with respect to ℓ1. Using triangle inequality, we obtain ℓ1(fα(x), y) ℓ1(f(x), y) + ℓ1(fα(x), f(x)) ℓ1(f(x), y) + αK pointwise, and choosing appropriate discretization scale allows us to turn agnostic bound for Fα into an agnostic bound for F. Multioutput Learnability Remark. We note that Theorem 12 holds for any norm on RK, but we only focus on ℓp norms here due to their practical significance. As the fat-shattering dimension of a realvalued function class characterizes its learnability with respect to d1 loss (Bartlett et al., 1996), Theorem 12 implies that a multioutput function class F is learnable with respect to ℓp for 1 p if and only if fatγ(Fk) < for all k [K] at every fixed scale γ > 0. Since we are only concerned with the question of learnability in this work, our focus is not on optimal sample complexity rates. However, we point out that for any p 1, if each Fk is learnable with respect to d1, then F is learnable with respect to ℓp via ERM with a better sample complexity than Algorithm 1. The proof of this claim is based on Rademacher complexity and is provided in Appendix D. 4. Online Multioutput Learnability Here, we study the online learnability of multioutput function classes. Throughout this section, we give regret bounds assuming an oblivious adversary. A standard reduction (Chapter 4 in Cesa-Bianchi and Lugosi (2006)) allows us to convert oblivious regret bounds to adaptive regret bounds in the full-information setting. A key requirement allowing an oblivious regret bound to generalize to an adaptive regret bound is that the learner s predictions on round t should not depend on any of its past predictions from previous rounds. This is true for all of the online learning algorithms in this section. 4.1 Online Agnostic-to-Realizable Reduction Our strategy for constructively characterizing the learnability of general losses in both the batch classification and regression setting required the ability to convert a realizable learner to an agnostic learner in a black-box fashion. In this section, we provide an analog of this conversion for the online setting. More specifically, we focus on the setting where F YX is a multioutput function class but |im(F)| < is finite. Then, for any c-subadditive loss function ℓ, we constructively convert a (potentially randomized) realizable online learner for F with respect to ℓinto an agnostic online learner for F with respect to ℓ. Theorem 13 formalizes the main result of this subsection. Theorem 13. Let F YX be a multioutput function class such that |im(F)| < and ℓ: Y Y R 0 be any c-subadditive loss function such that ℓ( , ) M. If A is a realizable online learner for F with respect to ℓwith sub-linear expected regret R(T, |im(F)|), then for every β (0, 1), there exists an agnostic online learner for F with respect to ℓwith expected regret c T T β R(T β, |im(F)|) + M q 2T 1+β ln(|im(F)|), where R(T, |im(F)|) is any concave, sublinear upperbound on R(T, |im(F)|). Note that if R(T β, |im(F)|) is sublinear in its first argument, then c T T β R(T β, |im(F)|) + M p 2T 1+β ln(|im(F)|) is sublinear in T for any β (0, 1). By Lemma 14, we are guaranteed the existence of R(T, |im(F)|). Raman, Subedi, and Tewari Lemma 14. (Ceccherini-Silberstein et al., 2017, Lemma 5.17) Let g be a positive sublinear function. Then, g is bounded from above by a concave sublinear function. Therefore, Theorem 13 and Lemma 14 show that for any function class F with finite image space and any c-subadditive loss function, realizable and agnostic online learnability are equivalent. We now begin the proof of Theorem 13. Proof Let A be a (potentially randomized) online realizable learner for F with respect to ℓ. By definition, this means that for any (realizable) sequence (x1, f(x1)), ..., (x T , f(x T )) labeled by a function f F, we have t=1 ℓ(A(xt), f(xt)) R(T, |im(F)|), where R(T, |im(F)|) is a sub-linear function of T. We now use A to construct an agnostic online learner Q for F with respect to ℓ. Since we are assuming an oblivious adversary, let (x1, y1), ..., (x T , y T ) (X Y)T denote the stream of points to be observed by the online learner and f = arg minf F PT t=1 ℓ(f(xt), yt) to be the optimal function in hindsight. Our high-level strategy is to construct a large set of Experts that approximately cover all possible labelings of the instances x1, ..., x T by functions in F. In particular, each Expert uses an independent copy of A to make predictions, but update A using different sequences of labeled instances. Together, our set of Experts update A using all possible sequences of labeled instances. In order to ensure that the number of Experts is not too large, we construct such a set of Experts over a sufficiently small sub-sample of the stream. Finally, we run the celebrated Randomized Exponential Weights Algorithm (REWA) (Cesa-Bianchi and Lugosi, 2006) using our set of experts and the scaled loss function ℓ M over the original stream of points (x1, y1), ..., (x T , y T ). We now formalize this idea below. For any bitstring b {0, 1}T , let φ : {t : bt = 1} im(F) denote a function mapping time points where bt = 1 to elements in the image space im(F). Let Φb (im(F)){t:bt=1} denote all such functions φ. For every f F, let φf b Φb be the mapping such that for all t {t : bt = 1}, φf b (t) = f(xt). Let |b| = |{t : bt = 1}|. For every b {0, 1}T and φ Φb, define an Expert Eb,φ. Expert Eb,φ, formally presented in Algorithm 3, uses A to make predictions in each round. However, Eb,φ only updates A on those rounds where bt = 1, using φ to compute a labeled instance (xt, φ(t)). For every b {0, 1}T , let Eb = S φ Φb{Eb,φ} denote the set of all Experts parameterized by functions φ Φb. If b is the all zeros bitstring, then Eb is empty. Therefore, we actually define Eb = {E0} S φ Φb{Eb,φ}, where E0 is the expert that never updates A and plays ˆyt = A(xt) for all t [T]. Note that 1 |Eb| (|im(F)|)|b|. With this notation in hand, we are now ready to present Algorithm 4, our main agnostic online learner Q for F with respect to ℓ. Our goal is to show that Q enjoys sublinear expected regret. There are three main sources of randomness: the randomness involved in sampling B, the internal randomness of A, and the internal randomness of REWA. Let B, A and P denote the random variables associated with each source of randomness respectively. By construction, B, A, and P are independent. Multioutput Learnability Algorithm 3 Expert(b, φ) Input: Independent copy of realizable online learner A for F with respect to ℓ 1 for t = 1, ..., T do 2 Receive example xt 3 Predict ˆyt = A(xt) 4 if bt = 1 then 5 Update A by passing (xt, φ(t)) Algorithm 4 Agnostic online learner Q for F with respect to ℓ Input: Parameter 0 < β < 1 1 Let B {0, 1}T such that Bt iid Bernoulli(T β 2 Construct the set of experts EB = {E0} S φ ΦB{EB,φ} according to Algorithm 3 3 Run REWA P using EB and the loss function ℓ M over the stream (x1, y1), ..., (x T , y T ) Using Theorem 21.11 in Shalev-Shwartz and Ben-David (2014) and the fact that B, A and P are independent, REWA guarantees almost surely that t=1 E [ℓ(P(xt), yt)|B, A] inf E EB t=1 ℓ(E(xt), yt) + M p 2T ln(|EB|). Taking an outer expectation gives t=1 ℓ(P(xt), yt) t=1 ℓ(E(xt), yt) 2T ln(|EB|) i . t=1 ℓ(Q(xt), yt) t=1 ℓ(P(xt), yt) t=1 ℓ(E(xt), yt) 2T ln(|EB|) i t=1 ℓ(EB,φf B (xt), yt) 2T ln(|EB|) i . In the last step, we used the fact that for all b {0, 1}T and f F, we have Eb,φf b Eb. It now suffices to upperbound E h PT t=1 ℓ(EB,φf B (xt), yt) i . To do so, we need some additional notation. Given the realizable online learner A, an instance x X, and an ordered finite sequence of labeled examples L (X Y) , let A(x|L) be the random variable denoting the prediction of A on the instance x after running and updating on L. For any b {0, 1}T , f F, and t [T], let Lf b 0 is a sufficient and necessary condition for online multioutput learnability. 4.4.3 Characterizing Learnability of Non-Decomposable Losses In this section, we characterize the online learnability of multioutput function classes F for a natural family of non-decomposable losses, ℓp norms for 1 p . We prove an analogous theorem to Theorem 12, by relating the online learnability of F with respect to ℓp to the online learnability of each Fk with respect to d1. We note that the proof of only uses the fact that ℓp norms are equivalent (up to a K dependent constant) to the ℓ1 norm. Since any two norms in a finite dimensional space are equivalent, Theorem 23 actually holds true for any norm in RK. But we only consider ℓp norms here due to their practical importance. Theorem 23. Let 1 p . A function class F YX is online learnable with respect to ℓp if and only if each Fk YX k is online learnable with respect to d1. By Theorem 20, F is online learnable with respect to ℓ1 if and only if each restriction Fk is online learnable with respect to d1. Thus, to prove Theorem 23 it suffices to show that F is online learnable with respect to ℓp if and only if F is online learnable with respect to ℓ1 for p > 1. At a high-level, the proof Theorem 23 follows a similar route as the proof of Theorem 12: convert an agnostic learner for F into a realizable learner for Fα and then use realizable-to-agnostic conversion for Fα. Proof Fix p > 1. By the argument above, it suffices to show that F is online learnable with respect to ℓp if and only if F is online learnable with respect to ℓ1. We begin by Multioutput Learnability proving sufficiency - if F is online learnable with respect to ℓ1 then F is online learnable with respect to ℓp. Let A be an online learner for F with respect to ℓ1. Our goal is to construct an online learner Q for F with respect to ℓp. We assume an oblivious adversary, and therefore the stream of points to be observed by the online learner Q, denoted (x1, y1), ..., (x T , y T ) (X [0, 1]K)T , is fixed beforehand. Let f = arg minf F PT t=1 ℓp(f(xt), yt) also denote the optimal function in hindsight with respect to the ℓp loss. Our strategy follows three steps. First, we show that A is a realizable online learner for Fα with respect to ℓp. Then, since |im(Fα)| ( 2 α)K < is finite and ℓp is a 1subadditive (by triangle inequality), Theorem 13 allows to convert the realizable online learner A for Fα with respect to ℓp into an agnostic online learner Q for Fα with respect to ℓp. Finally, for an appropriately selected discretization parameter α, we show that Q is also an agnostic online for F with respect to ℓp, which completes the proof. To that end, let (x1, f ,α(x1)), ..., (x T , f ,α(x T ) denote a (realizable) sequence of instances labeled by some function f ,α Fα. Since || ||q || ||p for q > p, we have that t=1 ℓp(A(xt), f ,α(xt)) t=1 ℓ1(A(xt), f ,α(xt)) Since A is an online learner for F with respect to ℓ1, we get, t=1 ℓ1(A(xt), f ,α(xt)) t=1 ℓ1(f(xt), f ,α(xt)) + RA(T) αKT + RA(T) where RA(T) is the regret of online learner A. Combining things together, we have that t=1 ℓp(A(xt), f ,α(xt)) αKT + RA(T), showing that A is a realizable online learner for Fα with respect to ℓp for a small enough α. Now, since |im(Fα)| ( 2 α)K < is a finite, ℓp is a 1subadditive loss function, and A is a realizable online learner, for any β (0, 1), Theorem 13 gives an agnostic online learner Q for Fα with respect to ℓp with the following regret guarantee over the original stream (x1, y1), ..., (x T , y T ): t=1 ℓp(Q(xt), yt) t=1 ℓp(fα(xt), yt)+ T T β (αKT β+RA(T β))+K 2T β+1K ln( 2 where αKT β + RA(T β) is any concave sublinear upperbound of αKT β + RA(T β). We also use the fact that the function T 7 αKT β is a concave sublinear function of T and the sum of two concave functions is itself a concave function. Noting that ℓp(fα(xt), yt) ℓp(f(xt), yt) + ℓp(fα(xt), f(xt)) ℓp(f(xt), yt) + αK, gives t=1 ℓp(Q(xt), yt) t=1 ℓp(f(xt), yt)+αKT+ T T β (αKT β+RA(T β))+K 2T β+1K ln( 2 Raman, Subedi, and Tewari Combining like terms together, we have t=1 ℓp(Q(xt), yt) t=1 ℓp(f(xt), yt) + 2αKT + T T β RA(T β) + K 2T β+1K ln( 2 Finally, picking α = 1 2KT gives that t=1 ℓp(Q(xt), yt) t=1 ℓp(f(xt), yt) 1 + T T β RA(T β) + K q 2T β+1K ln(4KT). Since RA(T β) is sublinear in T β and β (0, 1), Q enjoys sublinear expected regret. Thus, we have shown that Q is also an agnostic online learner for F with respect to ℓp. The reverse direction follows identically and uses the fact that for any p > 1, || ||p || ||1 K|| ||p. In particular, using the exact same argument, we can show that if A is an online learner for F with respect to ℓp, then A is also a realizable online learner for Fα with respect to ℓ1 with expected regret bound: t=1 ℓ1(A(xt), f ,α(xt)) t=1 ℓp(A(xt), f ,α(xt)) αK2T + KRA(T). Using A as the realizable learner in Theorem 13, for any β (0, 1), picking α = 1 (K+K2)T gives a regret bound: t=1 ℓ1(Q(xt), yt) t=1 ℓ1(f(xt), yt) 1 + KT T β RA(T β) + K q 2T β+1K ln(4K2T), where RA(T β) is any concave sublinear upperbound of RA(T β). Since β (0, 1), Q is an online learner for F with respect to ℓ1 as needed. This completes the proof of Theorem 23. Remark. As with decomposable losses, Theorem 23 also implies that for any ℓp norm loss, the finiteness of fatseq γ (Fk), for all k [K] and fixed γ > 0, is a sufficient and necessary condition for online multioutput learnability. 5. Discussion In this work, we give a characterization of multioutput learnability in four settings: batch classification, online classification, batch regression, and online regression. In all four settings, we show that a multioutput function class is learnable if and only if each restriction is learnable. All of our bounds in this paper scale with K, preventing our current techniques from extending to the case when K is infinite. Accordingly, we pose it as an open question to characterize multioutput learnability when K is infinite (e.g. function-space valued regression). Furthermore, we also leave it open to find combinatorial dimensions that provide tight quantitative characterizations of batch and online multioutput learnability. Multioutput Learnability Acknowledgements We acknowledge the support of NSF via grants IIS-2007055 (AT) and DMS-2413089 (AT, US). VR acknowledges the support of the NSF Graduate Research Fellowship. Noga Alon, Shai Ben-David, Nicol o Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 44(4), 1997. ISSN 00045411. doi: 10.1145/263867.263927. Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends R in Machine Learning, 4(3):195 266, 2012. Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. doi: 10.1017/CBO9780511624216. Idan Attias and Steve Hanneke. Adversarially robust pac learnability of real-valued functions, 2023. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, 2003. Peter L. Bartlett, Philip M. Long, and Robert C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3): 434 452, 1996. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.1996.0033. Shai Ben-David, Nicol o Cesa-Bianchi, and Philip M. Long. Characterizations of learnability for classes of {0, ..., n}-valued functions. Journal of Computer and System Sciences, 50 (1):74 86, 1995. Shai Ben-David, D avid P al, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009. Hanen Borchani, Gherardo Varando, Concha Bielza, and Pedro Larranaga. A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5):216 233, 2015. Philip J Brown and James V Zidek. Adaptive multivariate ridge regression. The Annals of Statistics, 8(1):64 74, 1980. Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability, 2022. URL https://arxiv.org/abs/2203.01550. R obert Busa-Fekete, Heejin Choi, Krzysztof Dembczynski, Claudio Gentile, Henry Reeve, and Balazs Szorenyi. Regret bounds for multilabel classification in sparse label regimes. Advances in Neural Information Processing Systems, 35:5404 5416, 2022. Raman, Subedi, and Tewari T. Ceccherini-Silberstein, M. Salvatori, and E. Sava-Huss. Groups, Graphs and Random Walks. London Mathematical Society Lecture Note Series. Cambridge University Press, 2017. doi: 10.1017/9781316576571. Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. Lena Chekina, Dan Gutfreund, Aryeh Kontorovich, Lior Rokach, and Bracha Shapira. Exploiting label dependencies for improved sample complexity. Machine learning, 91:1 42, 2013. Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Structured prediction theory based on factor graph complexity. Advances in Neural Information Processing Systems, 29, 2016. Amit Daniely and Tom Helbertal. The price of bandit information in multiclass online classification. In Conference on Learning Theory, pages 93 104. PMLR, 2013. Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesv ari, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 287 316, Barcelona, Spain, 13 15 Jun 2014. PMLR. Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 207 232, Budapest, Hungary, 09 11 Jun 2011. PMLR. Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. Regret analysis for performance metrics in multi-label classification: the case of hamming and subset zero-one loss. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part I 21, pages 280 295. Springer, 2010. Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. On label dependence and loss minimization in multi-label classification. Machine Learning, 88:5 45, 2012. Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity, 2019. Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Proceedings of the 24th annual conference on learning theory, pages 341 358. JMLR Workshop and Conference Proceedings, 2011. Claudio Gentile and Francesco Orabona. On multilabel classification and ranking with partial feedback. Advances in Neural Information Processing Systems, 25, 2012. Multioutput Learnability Giorgio Gnecco and Marcello Sanguineti. Estimates of the approximation error using rademacher complexity: learning vector-valued functions. Journal of Inequalities and Applications, 2008:1 16, 2008. Max Hopkins, Daniel M. Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you need. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3015 3069. PMLR, 02 05 Jul 2022. Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 935 944, 2016. Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classification using bayesian compressed sensing. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/ e1d5be1c7f2f456670de3d53c7b54f4a-Paper.pdf. Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17:115 141, 1994. Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent multilabel classification. Advances in Neural Information Processing Systems, 28, 2015a. Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent multilabel classification. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015b. URL https://proceedings.neurips.cc/paper/2015/ file/85f007f8c50dd25f5a45fca73cad64bd-Paper.pdf. Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285 318, 1987. Weiwei Liu and Ivor Tsang. On the optimality of classifier chain for multi-label classification. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ 854d9fca60b4bd07f9bb215d59ef5561-Paper.pdf. Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pages 3 17. Springer, 2016. Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 17(1):177 204, 2005. Raman, Subedi, and Tewari Omar Montasser, Steve Hanneke, and Nathan Srebro. Vc classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, pages 2512 2530. PMLR, 2019. Jinseok Nam, Eneldo Loza Menc ıa, Hyunwoo J Kim, and Johannes F urnkranz. Maximizing subset accuracy with recurrent neural networks in multi-label classification. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/ file/2eb5657d37f474e4c4cf01e4882b8962-Paper.pdf. B. K. Natarajan. On learning sets and functions. Mach. Learn., 4(1):67 97, oct 1989. ISSN 0885-6125. doi: 10.1023/A:1022605311895. URL https://doi.org/10.1023/A: 1022605311895. Junhyung Park and Krikamol Muandet. Towards empirical process theory for vector-valued functions: Metric entropy of smooth function classes. In Proceedings of The 34th International Conference on Algorithmic Learning Theory, 2023. Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. J. Mach. Learn. Res., 16(1):155 186, 2015a. Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability theory and related fields, 161: 111 153, 2015b. C Radhakrishna Rao. The theory of least squares when the parameters are stochastic and its application to the analysis of growth curves. Biometrika, 52(3/4):447 458, 1965. Henry Reeve and Ata Kaban. Optimistic bounds for multi-output learning. In International Conference on Machine Learning, pages 8030 8040. PMLR, 2020. Mark Rudelson and Roman Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, pages 603 648, 2006. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014. Leslie G. Valiant. A theory of the learnable. In Symposium on the Theory of Computing, 1984. V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. 1974. Vladimir Naumovich Vapnik and Aleksei Yakovlevich Chervonenkis. On uniform convergence of the frequencies of events to their probabilities. Teoriya Veroyatnostei i ee Primeneniya, 16(2):264 279, 1971. Shuo Xu, Xin An, Xiaodong Qiao, Lijun Zhu, and Lin Li. Multi-output least-squares support vector regression machines. Pattern Recognition Letters, 34(9):1078 1084, 2013. Multioutput Learnability Liang Yang, Xi-Zhu Wu, Yuan Jiang, and Zhi-Hua Zhou. Multi-label learning with deep forest. In ECAI, volume 325 of Frontiers in Artificial Intelligence and Applications, pages 1634 1641. IOS Press, 2020. Niloofar Yousefi, Yunwen Lei, Marius Kloft, Mansooreh Mollaghasemi, and Georgios C Anagnostopoulos. Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38):1 47, 2018. Raman, Subedi, and Tewari Appendix A. Complexity Measures A.1 Complexity Measures for Batch Learning In binary classification, the Vapnik-Chervonenkis (VC) dimension of a function class characterizes its learnability. Definition 24 (Vapnik-Chervonenkis Dimension). A set S = {x1, . . . , xd} is shattered by a binary function class H { 1, 1}d if for every σ { 1, 1}d, there exists a hypothesis hσ H such that for all i [d], we have hσ(xi) = σi. The VC dimension of H, denoted VC(H), is the size of the largest shattered set S X. If the size of the shattered set can be arbitrarily large, we say that VC(H) = . The learnability of a multiclass function class is characterized by its Natarajan dimension. Definition 25 (Natarajan Dimension). A set S = {x1, . . . , xd} is shattered by a multiclass function class H YX if there exist two witness functions f, g : S Y such that f(xi) = g(xi) for all i [d], and for every σ { 1, 1}d, there exists a function hσ H such that for all i [d], we have ( f(xi) if σi = 1 g(xi) if σi = 1 . The Natarajan dimension of H, denoted Ndim(H), is the size of the largest shattered set S X. If the size of the shattered set can be arbitrarily large, we say that Ndim(H) = . For real-valued regression problems, the learnability is characterized in terms of the fat-shattering dimension of a function class. Definition 26 (Fat-Shattering Dimension). A real-valued function class G [0, 1]X shatters points S = {x1, x2, . . . , xd} at scale 1 > γ > 0, if there exists witness functions r : S [0, 1] such that, for every σ { 1}d, there exists gσ G such that i [d], σi(gσ(xi) r(xi)) γ. The fat-shattering dimension of G at scale γ, denoted fatγ(G), is the size of the largest set that can be γ-shattered by G. If the size of the shattered set can be arbitrarily large, then we say that fatγ(G) = . We also define a general notion of complexity called Rademacher complexity that provides a sufficient and necessary condition of uniform convergence. Since uniform convergence implies learnability, we use Rademacher complexity to argue sufficient conditions for learnability. Definition 27 (Empirical Rademacher Complexity). Let D be a distribution over X Y. For a bounded loss function ℓ, define the loss class to be ℓ F = {(x, y) 7 ℓ(f(x), y)}. If S = {(x1, y1), ..., (xn, yn)} be a set of i.i.d samples drawn from D, then the empirical Rademacher complexity of ℓ F is defined as Rn(ℓ F) = Eσ i=1 σiℓ(f(xi), yi) where σ { 1}n is a sequence of n i.i.d. Rademacher random variables. Multioutput Learnability A.2 Complexity Measures for Online Learning For the complexity measures below, it is useful to define a Z-valued binary tree (Rakhlin et al., 2015b). A binary tree T of depth d is Z-valued if each of its internal nodes are labelled by elements of Z. Such a tree can be identified by a sequence (T1, ..., Td) labelling functions Ti : { 1}i 1 Z which provide labels for each internal node. A path of length d is given by a sequence σ = (σ1, ..., σd) { 1}d. Then, Ti(σ1, ..., σi 1) gives the label of node following the path (σ1, ..., σi 1) starting from the root, going right if σj = +1 and left if σj = 1. Note that, T1 Z is the label for the root node. For brevity, we slightly abuse notation by letting Ti(σ1, ..., σi 1) = Ti(σ 0 such that a ℓ(h(x), y) 1{h(x) = y} for any (x, y) X Y and function h YX . Let D be a realizable distribution to F with respect to ℓ. Since ℓ(y1, y2) = 0 if and only if 1{y1 = y2} = 0, the distribution D is also realizable with respect to 0-1 loss. Since F is learnable with respect to 0-1 loss, there exists a learning algorithm A with the following property: for any ϵ, δ > 0, for a sufficiently large S Dn, the algorithm outputs a predictor h = A(S) such that, with probability 1 δ over S Dn, we have ED[1{h(x) = y}] a ϵ. Using the inequality stated above pointwise, the predictor h also satisfies ED[ℓ(h(x), y)] ϵ. Therefore, A is also a realizable algorithm with respect to ℓ. Since ℓsatisfies the identity of indiscernible, Lemma 4 guarantees the existence of agnostic PAC learner B for F with Multioutput Learnability respect to ℓ. Proof (of necessity) Suppose F is learnable with respect to ℓ. Since the target space is finite, there must exist a constant b > 0 such that 1{h(x) = y} b ℓ(h(x), y) for any (x, y) X Y and any function h YX . Let D be a realizable distribution with respect to ℓ. Due to the 0 alignment property, D is also realizable with respect to 0-1 loss. Since F is learnable with respect to ℓloss, there exists a learning algorithm A with the following property: for any ϵ, δ > 0, for a sufficiently large S Dn, the algorithm outputs a predictor h = A(S) such that, with probability 1 δ over S Dn, we have ED[ℓ(h(x), y)] b ϵ. In particular, using the inequality stated above pointwise, we obtain ED[1{h(x) = y}] ϵ. Therefore, F is learnable with respect to 0-1 loss in the realizable setting. As the finiteness of the Natarajan dimension is necessary for the learnability of F under the 0-1 loss (Natarajan, 1989), we must have Ndim(F) < . Appendix C. Proofs for Batch Multioutput Regression C.1 Proof of Sufficiency in Theorem 9 Proof We first prove that the agnostic learnability of each Fk is sufficient for the agnostic learnability of F. As in the classification setting, the proof here is based on a reduction. That is, given oracle access to agnostic learners Ak for each Fk with respect to ψk d1 loss, we construct an agnostic learner A for F with respect to loss ℓ. Denote Dk to be the marginal distribution of D restricted to X Yk. Let us use mk(ϵ, δ) to denote the sample complexity of Ak. Then, for all k [K], the marginal samples Sk = {(xi, yk i )}n i=1 with scalar-valued targets are iid samples form Dk. For each k [K], define gk = Ak(Sk) to be the predictor returned by algorithm Ak when trained on Sk. Since Ak is an agnostic learner for Fk, we have that for sample size n maxk mk( ϵ K ), with probability at least 1 δ/K over samples Sk Dn k, EDk[ψk d1(gk(x), yk)] inf fk Fk EDk[ψk d1(fk(x), yk)] + ϵ Summing these risk bounds over all k coordinates and union bounding over the success probabilities, we get that with probability at least 1 δ over samples S Dn, k=1 EDk[ψk d1(gk(x), yk)] k=1 inf fk Fk EDk[ψk d1(fk(x), yk)] + ϵ. Using the fact that the sum of infimums over individual coordinates is at most the overall infimum of sums followed by the linearity of expectation, we can write the expression above as k=1 ψk d1(gk(x), yk) k=1 ψk d1(fk(x), yk) This shows that the learning rule that runs Ak on marginal samples Sk and concatenates the resulting scalar-valued predictors to get a vector-valued predictor is an agnostic learner Raman, Subedi, and Tewari for F with respect to loss ℓwith sample complexity at most maxk mk(ϵ/K, δ/K). This completes our proof of sufficiency. C.2 Equivalence of d1 and ψ d1 Learnability in Batch Regression In this section, we provide proof of Lemma 11, which establishes the equivalence of d1 and ψ d1 learnability in scalar-valued batch regression. Proof (of Lemma 11) To prove sufficiency first, let G be agnostically learnable with respect to d1. This implies that the fat-shattering dimension of G is finite at every scale (Bartlett et al., 1996) and uniform convergence holds over the loss class d1 G. Since ψ is a Lipschitz function, a simple application of Talagrand s contraction lemma on Rademacher complexity (Bartlett and Mendelson, 2003) implies that uniform convergence holds over the loss class ψ d1 G as well. Thus, G is learnable with respect to ψ d1 via ERM. Next, we show that if G [0, 1]X is learnable with respect to ψ d1, then G is learnable with respect to d1. Since the fat-shattering dimension of G characterizes d1 learnability of G, it suffices to show that G being learnable with respect to ψ d1 implies fatγ(G) < for every γ (0, 1). Suppose, for the sake of contradiction, G is learnable with respect to ψ d1 but there exists a scale γ (0, 1) such that fatγ(G) = . Then, for every d N, there exists X = {x1, . . . , xd} X and a witness function r : X [0, 1] such that for every σ { 1, 1}d, there exists a gσ G such that σi(gσ(xi) r(xi)) γ for all i [d]. Define GX = {gσ G | σ { 1, 1}d} be the set of functions that shatters X. Define H = { 1, 1}X to be a set of all functions from X to { 1, 1}. By definition of H, we must have VC(H) = d. We use an agnostic learner for G with respect to ψ d1 to construct an agnostic learner for H whose sample complexity, for large enough d, is smaller than the known lower bound for VC classes. Since fatγ(G) = , d can be made arbitrarily large and thus we derive a contradiction. Let A be the promised agnostic learner for G with respect to ψ d1 with sample complexity m(ϵ, δ). For all f [0, 1]X, define a threshold function hf : X { 1, 1} as hf(x) = 2 1{f(x) r(x)} 1. Let D be an arbitrary distribution on X { 1, 1} and DX be its marginal on X. Algorithm 7 Agnostic PAC learner for H Input: Agnostic learner A for G, unlabeled samples SU DX, and another independent labeled samples SL D 1 Define Saug = {(SU, gα(SU)) | g GX}, all possible augmentations of SU by α-discretization of functions in GX for α γ/2. 2 Run A over all possible augmentations to get C(SU) := A S | S Saug . 3 Define C 1(SU) = {hf | f C(SU)}, a thresholded class of C(SU). 4 Return the predictor in C 1(SU) with the lowest empirical 0-1 risk over SL. We now show that Algorithm 7 is an agnostic learner for H. Consider d SU + SL. Then, |C 1(SU)| = |Saug| (2/α)|SU| can be much smaller than 2d. Let h := arg minh H ED[1{h(x) = y}] be the optimal hypothesis for D. Note that, by definition of Multioutput Learnability shattering, for every h H, there exists a g GX such that h(x) = hg(x) for all x X. In particular, there must exist g GX such that h (x) = hg (x) := 2 1{g (x) r(x)} 1 for all x X. Let gα denote the α-discretization of g as defined in Equation (1) for some α γ/2. Now, consider a sample (SU, g ,α(SU)) Saug. Let ˆg = A((SU, g ,α(SU))) be the predictor returned by the algorithm when run on a sample labeled by g ,α. Define hˆg = 2 1{ˆg(x) r(x)} 1 to be its thresholded function. Then, using the triangle inequality on the indicator function, we have ED[1{hˆg(x) = y}] ED[1{h (x) = y}] + EDX[1{hˆg(x) = h (x)}]. (2) Note that 1{hˆg(x) = h (x)} = 1{hˆg(x) = hg (x)} 1{|ˆg(x) g (x)| γ}. To see why the last inequality is true, we only have to consider the case where the indicator on the left is 1, otherwise, the inequality is trivial. Recall that 1{hˆg(x) = hg (x)} = 1 whenever ˆg(x) and g (x) lie on the opposite side of witness r(x). Since g has to be at least γ away from the witness r(x), we obtain 1{|ˆg(x) g (x)| γ} = 1. Next, using the fact that α γ/2, we have 1{|ˆg(x) g (x)| γ} 1{|ˆg(x) g ,α(x)| γ/2} because discretization can decrease the distance between these functions by at most γ/2. Furthermore, using monotonicity of ψ, we get 1{|ˆg(x) g ,α(x)| γ/2} 1{ψ(|ˆg(x) g ,α(x)|) ψ(γ/2)} . Combining everything, we get a pointwise inequality 1{hˆg(x) = h (x)} 1{ψ(|ˆg(x) g ,α(x)|) ψ(γ/2)} 1 ψ(γ/2)ψ(|ˆg(x) g ,α(x)|). Using this inequality gives an upperbound on the risk of hˆg, namely EDX[1{hˆg(x) = h (x)}] 1 ψ(γ/2) ED[ψ(|ˆg(x) g ,α(x)|)]. (3) Since ˆg = A((SU, g ,α(SU))), we can use the algorithm s guarantee to get a further upperbound on the expectation above. In particular, if |SU| m(ϵψ(γ/2) 4 , δ/2), then with probability at least 1 δ/2 over sampling SU DX, we have EDX[ψ(|ˆg(x) g ,α(x)|)] inf g G EDX[ψ(|g(x) g ,α(x)|)] + ϵψ(γ/2) Note that infg G EDX[ψ(|g(x) g ,α(x)|)] EDX[ψ(|g (x) g ,α(x)|)] ψ(α), where the last step uses the fact that |g (x) g ,α(x)| α and ψ is monotonic. Using L-Lipschitzness of ψ and the fact that ψ(0) = 0, we get ψ(α) Lα. Picking α = min(γ/2, ϵψ(γ/2) 4L ), we get infg G EDX[ψ(|g(x) g ,α(x)|)] ϵψ(γ/2) 4 . Plugging this back to the inequality in the display above, we get to EDX[ψ(|ˆg(x) g ,α(x)|)] ϵψ(γ/2) 2 . Using this guarantee on (3), we obtain EDX[1{hˆg(x) = h (x)}] ϵ This bound applied to (2) yields ED[1{hˆg(x) = y}] inf h H ED[1{h(x) = y}] + ϵ Raman, Subedi, and Tewari Thus, we have shown the existence of a predictor hˆg C 1(SU) that achieves agnostic PAC bounds for H. Let ˆh be the predictor returned by step 4 of the algorithm. Next, we show that for sufficiently large SL, the predictor ˆh also attains agnostic PAC bounds. Recall that by Hoeffding s Inequality and union bound, with probability at least 1 δ/2, the empirical risk of every hypothesis in C 1(SL) on a sample of size 8 ϵ2 log 4|C 1(SU)| at most ϵ/4 away from its true error. So, if |SL| 8 ϵ2 log 4|C 1(SU)| δ , then with probability at least 1 δ/2, the empirical risk of the predictor hˆg(x) is (x,y) SL 1{hˆg(x) = y} ED[1{hˆg(x) = y}] + ϵ 4 inf h H ED[1{h(x) = y}] + 3ϵ where the last inequality follows from the risk guarantee of hˆg established above. Since ˆh is the empirical risk minimizer over SL, we must have (x,y) SL 1{ˆh(x) = y} 1 |SL| (x,y) SL 1{hˆg(x) = y} inf h H ED[1{h(x) = y}] + 3ϵ Finally, as the population risk of ˆh is at most ϵ/4 away from its empirical risk, we have ED[1{ˆh(x) = y}] inf h H ED[1{h(x) = y}] + ϵ, which is the agnostic PAC guarantee for H. Applying union bounds, the entire process, running algorithm A on the dataset augmented by g ,α and the ERM in step 4, succeeds with probability 1 δ. This establishes that the Algorithm 7 is an agnostic PAC learner for H. The sample complexity of Algorithm 7 is the number of samples required for Algorithm A to succeed and the ERM in step 4 to succeed. Thus, the overall sample complexity of Algorithm 7, denoted m H(ϵ, δ), can be bounded as m H(ϵ, δ) m A ϵ2 log 4|C 1(SU)| ϵ2 log 2 min(γ/2, ϵψ(γ/2)/4) where the second inequality follows because |C 1(SU)| = |Saug| (2/α)|SU| and we need |SU| to be of size m A ϵ ψ(γ/2) 2 . We also use the fact that α = min(γ 2 ) 4 ). However, it is well known (Shalev-Shwartz and Ben-David, 2014, Theorem 6.8) that the sample complexity of learning H in agnostic setting is C d + log(2/δ) for some C > 0. Thus, we must have m H(ϵ, δ) C(d + log(2/δ))/ϵ2. However, this is a contradiction because d can be arbitrarily large but m H(ϵ, δ) must have a finite upper bound for every fixed ϵ, δ. Therefore, the function class G cannot be learnable with respect to ψ d1 whenever there exists a scale γ (0, 1) such that fatγ(G) = . Multioutput Learnability Appendix D. Rademacher Based Proof for Batch Regression To show that the learnability of each Fk with respect to d1 is sufficient for the learnability of F with respect to ℓp norms for p 1, we use the fact that ℓp(f(x), y) is a KLipschitz in its first argument with respect to norm, that is |ℓp(f(x), y) ℓp(g(x), y)| K f(x) g(x) , and use the following bound on the Rademacher complexity of the loss class ℓ F = {(x, y) 7 ℓ(f(x), y)} | f F}. Lemma 33 (Foster and Rakhlin (2019)). Let F YX be a multioutput function class. For any δ (0, 1), there exists a constants 0 < c < 1 and C > 0 such that Rn(ℓp F) K inf α>0 fatcϵ(Fk) log1+δ e n The result presented here is in fact the intermediate result in Foster and Rakhlin (2019), and we provide a sketch of how their argument can be adapted to our setting. Proof Note that for f, g F, we have |ℓp(f(x), y) ℓp(g(x), y)| |ℓp(f(x), g(x))| ℓ1(f(x), g(x)) K f(x) g(x) . Furthermore, we have that |fk(x) yk| 1, so we obtain |ℓp(f(x), y)| K. Define the normalized ℓp loss as ℓp(f(x), y) := ℓp(f(x), y)/K. By standard chaining argument, we know that Rn(ℓ F) = K Rn( ℓ F) K inf α>0 log N2( ℓp F, ϵ, n)dϵ . Since a cover with || || norm is also a cover with respect to || ||2 norm, we have that log N2( ℓp F, ϵ, n) log N ( ℓ F, ϵ, n). Since ℓp(f(x), y) is 1 Lipschitz with respect to || || norm, following Lemma 1 of Foster and Rakhlin (2019), we obtain log N ( ℓp F, ϵ, n) PK k=1 log N (Fk, ϵ, n). A result due to Rudelson and Vershynin (2006) states that for any δ (0, 1), there exists constants 0 < ck < 1 and Ck > 0 such that log N (Fk, ϵ, n) Ck fatckϵ(Fk) log1+δ (en/ϵ). Picking C = maxk Ck and c = mink ck, we obtain the contraction inequality Rn(ℓ F) K inf α>0 4α + 12 C n k=1 fatcϵ(Fk) log1+δ e n Using q PK k=1 fatcϵ(Fk) log1+δ e n fatcϵ(Fk) log1+δ e n ϵ yields the desired contraction inequality. With Lemma 33 in our repertoire, the sufficiency proof is a routine uniform convergence argument. Proof (of sufficiency in Theorem 12) Suppose each restriction Fk is learnable with respect to d1. Then, we know that for all k [K] and for all 1 > γ > 0, we have fatγ(Fk) < Raman, Subedi, and Tewari (Bartlett et al., 1996), (Anthony and Bartlett, 1999, Chatper 19)). Using Lemma 33, for δ = 1/2, we can find constants c, C such that Rn(ℓp F) K inf α>0 fatcϵ(Fk) log3/2 e n Fix α > 0. The second term inside infimum vanishes as n , yielding Rn(ℓ F) 4αK. As α > 0 is arbitrary, the Rademacher complexity Rn(ℓ F) goes to 0 as n . This argument can be readily turned into non-asymptotic bounds on Rn(ℓp F) if the precise form of fatγ(Fk) as a function of γ is known. Since the empirical Rademacher complexity vanishes, uniform convergence holds over the loss class ℓp F and thus F is learnable with respect to ℓp via empirical risk minimization. Appendix E. Online Multilabel Learnability with respect to Hamming Loss In this section, we provide the proof of Theorem 15. Proof We first prove that the online learnability of each restriction is sufficient for the online learnability of ℓH. Part 1: Sufficiency. Our proof is based on a reduction: given oracle access to online learners {Ak}K k=1 for {Fk}K k=1 with respect to ℓ0-1, we construct an online learner A for F with respect to ℓH. In fact, similar to the batch setting, the online multilabel learning algorithm A is simple: in each round t [T], receive xt, query the predictions A1(xt), ..., AK(xt), and finally predict the concatenation ˆyt = (A1(xt), ..., AK(xt)). Once the true label yt = (y1 t , ..., y K t ) is revealed, update each online learner Ak by passing (xt, yk t ) for k [K]. It suffices to show that the expected regret of A is sublinear in T with respect to ℓH. By Definition 5, we have that for all k [K], t=1 1{Ak(xt) = yk t } inf fk Fk t=1 1{fk(xt) = yk t } where Rk(T) is some sublinear function in T. Summing the regret bounds across all k [K] splitting up the expectations, and using linearity of expectation, we get that E h PK k=1 PT t=1 1{Ak(xt) = yk t } i E h PK k=1 inffk Fk PT t=1 1{fk(xt) = yk t } i PK k=1 Rk(T). Noting that PK k=1 inffk Fk PT t=1 1{fk(xt) = yk t } inff F PK k=1 PT t=1 1{fk(xt) = yk t }, swapping the order of summations, and using the definition of ℓH we have that, t=1 ℓH(A(xt), yt) t=1 ℓH(f(xt), yt) where A(xt) = (A1(xt), ..., AK(xt)). This concludes the proof of this direction since PK k=1 Rk(T) is still a sublinear function in T. Part 2: Necessity. Next we prove that if F is online learnable with respect to to ℓH, then each Fk is online learnable with respect to ℓ0-1. Namely, given oracle access to an online Multioutput Learnability learner A for F with respect to ℓH, we construct an online learner B for F1 with respect to ℓ0-1. A similar reduction can be used to construct online learners for each restriction Fk. Similar to the batch setting, the online learning algorithm B is simple: in each round t [T], receive xt, query ˆyt = A(xt) and predict ˆy1 t = A1(xt). Once the true label y1 t is revealed, update A by passing (xt, yt) where yt = (y1 t , σ2 t , ..., σK t ) and {σi t}K i=2 is an i.i.d sequence of Rademacher random variables. It suffices to show that the expected regret of B is sublinear in T with respect to ℓ0-1. As previously mentioned, we assume that the sequence (x1, y1 1), ..., (x T , y1 T ) is chosen by an oblivious adversary, and thus is not random. Let yt = (y1 t , σ2 t , ..., σK t ). By Definition 5, we have that, t=1 ℓH(A(xt), yt) inf f F t=1 ℓH(f(xt), yt) where the expectation is over both the randomness of A(xt) and (σ2 t , ..., σK t ) and R(T, 2K) is a sub-linear function of T. Splitting up the expectation, using the definition of the Hamming loss, and by the linearity of expectation, we have that k=1 E h 1{Ak(xt) = yk t } i inf f F k=1 E h 1{fk(xt) = yk t } i R(T, 2K). Next, observe that for every t [T], for every k {2, ..., K}, the randomness of yk t = σk t implies E 1{Ak(xt) = yk t } = E 1{f(xt) = yk t } = 1 t=1 E 1{A1(xt) = y1 t } + K 1 t=1 E 1{f1(xt) = y1 t } + K 1 Canceling constant factors gives, E h PT t=1 1{A1(xt) = y1 t } i inff1 F1 PT t=1 1{f1(xt) = y1 t } R(T, 2K), showing that B is an online agnostic learner for F1 with respect to ℓ0-1. Appendix F. MCLdim Characterizes Online Multilabel Learnability In this section, we show that the MCLdim characterizes the online learnability of a multilabel function class F YX with respect to to any loss ℓthat satisfies the identity of indiscernibles. Theorem 34 makes this more precise. Theorem 34. Let ℓbe any loss function satisfying the identity of indiscernibles. A function class F YX is online learnable with respect to ℓif and only if MCLdim(F) < . Proof (of sufficiency) We first show that finiteness of MCLdim is sufficient for online learnability. The proof follows exactly like the proof of Lemma 16. We include it here again for completeness sake. Let ℓbe any loss function satisfying the identity of indiscernibles and F YX be a multilabel function class such that MCLdim(F) = d < . Since F has finite MCLdim, the deterministic Multiclass Standard Optimal Algorithm for F , Raman, Subedi, and Tewari hereinafter denoted MCSOA(F), achieves mistake-bound d in the realizable setting (Daniely et al., 2011). Therefore, following the same procedure as in Daniely et al. (2011), we can construct a finite set of experts E of size |E| = Pd j=0 T j |im(F)|j (2KT)d such that for any (oblivious) sequence of instances x1, ..., x T , for any function f F, there exists an expert Ef E, such that f(xt) = E(xt) for all t [T]. Finally, running the celebrated Randomized Exponential Weights Algorithm (REWA) using E as the set of experts and the scaled loss function ℓ B [0, 1] guarantees that for any labelled sequence (x1, y1), ..., (x T , y T ), t=1 ℓ(ˆyt, yt) inf E E t=1 ℓ(E(xt), yt) t=1 ℓ(ˆyt, yt) inf f F t=1 ℓ(f(xt), yt) T ln(|E|) O B p where ˆyt is the prediction of REWA in the t th round. Thus, running REWA over the set of experts E using ℓ B gives an online learner for F with respect to ℓ. Proof (of necessity) To prove necessity, we need to show that if F is online learnable with respect to ℓ, then MCLdim(F) < . To do so, we show that if F is online learnable with respect to ℓ, then F is online learnable with respect to ℓ0-1 in the realizable setting. Let A be an online learner for F with respect to ℓ. Then, by definition, t=1 ℓ(A(xt), yt) inf f F t=1 ℓ(f(xt), yt) where R(T, 2K) is a sublinear function in T. Since ℓsatisfies the identity of indiscernibles, in the realizable setting, inff F PT t=1 ℓ(f(xt), yt) = 0. Therefore, under realizability, t=1 ℓ(A(xt), yt) Because there are only a finite number of inputs to ℓ, there must exist a universal constant a such that aℓ0-1 ℓ. Substituting in gives that, t=1 ℓ0-1(A(xt), yt) Since a is a universal constant that does not depend on T, R(T,2K) a is still a sublinear function in T, implying that A is also a realizable online learner for F with respect to ℓ0-1. This completes the proof as MCLdim characterizes realizable learnability and so we must have MCLdim(F) < . Multioutput Learnability Appendix G. Proofs for Bandit Online Multilabel Classification In this section, we provide proofs for the characterization of online multilabel classification under bandit feedback. Proof (of Theorem 19) The proof of Theorem 19 is nearly identical to the proof of Theorem 13. The only difference is that in Algorithm 4, we now need use the bandit Expert s algorithm EXP4 from Auer et al. (2002) instead of REWA. Similar to REWA, based on Theorem 2.3 in Daniely and Helbertal (2013) and the fact that A, B and P are independent, EXP4 guarantees that t=1 ℓ(P(xt), yt) t=1 ℓ(E(xt), yt) 2T|Y| ln(|EB|) i , where P(xt) denotes EXP4 s prediction in round t. The remaining proof for deriving the upper bound t=1 ℓ(E(xt), yt) t=1 ℓ(f(xt), yt) + c T T β R(T β, |Y|) is identical to that in Theorem 13, so we omit it here. Putting these pieces together gives the stated guarantee. Proof (of Theorem 18) Let c = maxr =t ℓ(r,t) minr =t ℓ(r,t) . We first show necessity: if F is bandit online learnable with respect to ℓ, then each restriction Fk is online learnable with respect to ℓ0-1. This follows trivially from the fact that if A is a bandit online learner for F, then A is also an online learner for F under full-feedback. Thus, by Theorem 17, online learnability of F with respect to ℓimplies online learnability of restriction Fk with respect to the 0-1 loss. We now focus on showing sufficiency: if for all k [K], Fk is online learnable with respect to 0-1 loss, then F is bandit online learnable with respect to loss ℓ. Since |Y| = 2K < and ℓis a c-subadditive, by Theorem 19, it suffices to show that there exists a realizable online learner for F with respect to ℓ. However, using Theorem 17, online learnability of each restriction Fk with respect to 0-1 the loss implies (agnostic) online learnability of F with respect to ℓ. Since an agnostic online learner is trivially a realizable online learner, the proof is complete. Appendix H. Equivalence of d1 and ψ d1 Online Learnability Proof (of Lemma 22) Since ψ is a Lipschitz function, the proof of sufficiency follows immediately from Corollary 5 in Rakhlin et al. (2015a), a contraction Lemma for the sequential Rademacher complexity. Thus, we focus on proving necessity - if G is online learnable with respect to ψ d1, then G is online learnable with respect to d1. Since the sequential fat shattering dimension of G characterizes d1 learnability (Rakhlin et al., 2015a), it suffices to show that G being online learnable with respect to ψ d1 implies fatseq γ (G) < 0 for every γ (0, 1). Like in the batch setting, we prove this via contradiction. Raman, Subedi, and Tewari Suppose, for the sake of contradiction, G is online learnable with respect to ψ d1 but there exists a scale γ (0, 1) such that fatseq γ (G) = . Then, for every T N, there exists a X-valued binary tree T and a [0, 1]-valued binary witness tree R both of depth T such that for all σ { 1, 1}T , there exists gσ G such that for all t [T], σt(gσ(Tt(σ 0 such that fatseq γ (G) = , there cannot exist an online learner for G with respect to ψ d1.