# a_characterization_of_multioutput_learnability__9e6a8a6e.pdf

Journal of Machine Learning Research 25 (2024) 1-54 Submitted 6/23 ; Revised 6/24; Published 10/24

A Characterization of Multioutput Learnability

Vinod Raman vkraman@umich.edu Department of Statistics University of Michigan Ann-Arbor, MI 48109, USA

Unique Subedi subedi@umich.edu Department of Statistics University of Michigan Ann-Arbor, MI 48109, USA

Ambuj Tewari tewaria@umich.edu Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Ann-Arbor, MI 48109, USA

Editor: Gergely Neu

We consider the problem of learning multioutput function classes in the batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classiﬁcation and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting. Keywords: Learnability, Online Learning, Multilabel Classiﬁcation, Multioutput Regression

1. Introduction

Multioutput learning is a problem where an instance is labeled by a vector-valued target. This is a generalization of scalar-valued-target learning settings such as multiclass classiﬁcation and regression. Multioutput learning has enjoyed a wide range of practical applications like image tagging, document categorization, recommender systems, an weather forecasting to name a few. This widespread applicability has motivated the development of several practical methods (Kapoor et al., 2012; Borchani et al., 2015; Yang et al., 2020; Xu et al., 2013; Nam et al., 2017), as well as theoretical analysis (Koyejo et al., 2015b; Liu and Tsang, 2015). However, the most fundamental question of learnability in a multioutput setting remains unanswered. Characterizing learnability is the foundational step toward understanding any statistical learning problem. The fundamental theorem of statistical learning characterizes the learn-

. Equal contribution

c 2024 Vinod Raman, Unique Subedi, and Ambuj Tewari.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0787.html.

Raman, Subedi, and Tewari

ability of binary function class in terms of the ﬁniteness of a combinatorial quantity called the Vapnik-Chervonenkis (VC) dimension (Vapnik and Chervonenkis, 1971, 1974). Extending VC theory, Natarajan (1989) proposed and studied the Natarajan dimension, which was later shown by Ben-David et al. (1995) to characterize learnability in multiclass settings with a ﬁnite number of labels. In fact, Ben-David et al. (1995) observed that the learnability of multiclass function class F {e1, . . . , e K}X can also be characterized in terms of the learnability of each component-wise binary function class Fk = {x 7 f(x), ek : f F}, where {e1, . . . , e K} is the standard basis of RK and , is the Euclidean inner-product. Furthermore, they deﬁne an equivalence class of loss functions for the 0-1 loss and characterize the learnability of multiclass problems with respect to all losses in the equivalence class. We take a similar approach in this paper. Closing the question of learnability for multiclass problems, Brukhim et al. (2022) shows that the Daniely-Shwartz (DS) dimension, originally proposed by Daniely and Shalev-Shwartz (2014), characterizes multiclass learnability in the inﬁnite labels setting. Similarly, in the online setting, the Littlestone dimension (Littlestone, 1987) characterizes the online learnability of a binary function class and a generalization of the Littlestone dimension (Daniely et al., 2011) characterizes online learnability in the multiclass setting with ﬁnite labels. As for scalar-valued regression, the fat shattering (Bartlett et al., 1996) and the sequential fat shattering (Rakhlin et al., 2015a) dimensions characterize batch and online learnability respectively. Surprisingly, to our best knowledge, no such characterization of the learnability of multioutput function classes exists in the literature. In this paper, we close this gap by characterizing the learnability of function classes F YX , where Y RK is vector-valued target space for some K N. Let us deﬁne scalarvalued function classes Fk = {x 7 f(x), ek : f F} for each k [K], where {e1, . . . , e K} is the standard basis of RK. Similarly, deﬁne Yk := { y, ek : y Y}. Our main result, informally stated below, asserts that F is learnable if and only if each coordinate restriction Fk is learnable.

Theorem. ( Informal) A multioutput function class F YX is learnable if and only if each restriction Fk YX k is learnable.

We prove a version of this result in four canonical settings: batch classiﬁcation, online classiﬁcation, batch regression, and online regression. For the batch settings, we consider the PAC framework and for the online settings, we consider the fully adversarial model. In addition, our result holds for a wide family of loss functions. A unifying theme throughout all four learning settings is our ability to constructively convert a learning algorithm A for F into learning algorithm Ak for Fk for each k {1, ..., K} and vice versa. We show that even for multioutput losses that tightly couple the K coordinates of a function class, their learnability still depends on the learnability of each coordinate. For the batch setting, our algorithmic techniques use the realizable-to-agnostic conversion introduced by Hopkins et al. (2022). In the online setting, we provide a new realizable-to-agnostic conversion similar in the spirit of Hopkins et al. (2022). In principle, both ours and Hopkins et al. (2022) s realizable-to-agnostic conversion is based on the idea of using algorithms to construct a cover of function classes, originally introduced in the seminal work of Ben-David et al. (2009).

Multioutput Learnability

Our proof techniques, however, do not extend naturally to the case when K is inﬁnite. So, characterizing learnability for an inﬁnite-dimensional target space is an interesting open problem. Moreover, our reductions are computationally ineﬃcient and lead to sub-optimal sample complexities and regret bounds. As such, we leave the construction of eﬃcient and optimal multioutput learning algorithms as an interesting direction for the future work.

1.1 Related works

Multilabel classiﬁcation has been extensively studied in the batch setting. We review a few works here and also refer the reader to the references therein. Dembczy nski et al. (2010) quantify the 0-1 risk of multilabel classiﬁers trained by minimizing the Hamming loss and vice versa. Dembczy nski et al. (2012) and Chekina et al. (2013) study how exploiting dependencies between labels can improve the predictive performance of multilabel classiﬁers and how such exploitation interacts with loss minimization. Jain et al. (2016) consider the case where the label set is extremely large and design new loss functions to handle these settings. Busa-Fekete et al. (2022) derive upper and lower bounds on the excess risk for non-parametric and parametric function classes for various loss functions assuming label sparsity. Gao and Zhou (2011) and Koyejo et al. (2015a) study the consistency of surrogate loss functions for multilabel classiﬁcation. Finally, Gentile and Orabona (2012) consider online multilabel classiﬁcation under partial feedback and present a novel algorithm based on second-order descent methods.

There is also a long history of studying least squares estimators for multioutput linear models in the statistical literature, see (Rao, 1965; Brown and Zidek, 1980) and references therein. The topic received widespread attention in learning theory following the seminal work of Micchelli and Pontil (2005) in RKHS methods for vector-valued regression. We refer the reader to a comprehensive review of kernel methods for vector-valued regression by Alvarez et al. (2012). An early work of Gnecco and Sanguineti (2008) provides estimation and approximation error of vector-valued functions using Rademacher complexity. Following the inﬂuential work of Maurer (2016) on Rademacher contraction inequalities for vector-valued functions, there have been works on the Rademacher analysis of vector-valued functions (see Cortes et al. (2016); Reeve and Kaban (2020); Youseﬁet al. (2018); Foster and Rakhlin (2019)). Finally, we point out a recent work by Park and Muandet (2023) towards developing empirical process theory for vector-valued functions.

2. Preliminaries

Let X denote the instance space and Y RK be the target space for some K N. For a space Z, we let Z be the set of all ﬁnite sequences of elements from Z. Consider a vector-valued function class F YX , where YX denotes set of all functions from X to Y. For an unlabeled sample SU X , let F|SU denote the projection of F onto SU. Let , denote the Euclidean inner-product on RK. Deﬁne scalar-valued function classes Fk = {x 7 f(x), ek : f F} for each k [K], where {e1, . . . , e K} is the standard basis of RK. Here, each Fk YX k , where Yk = { y, ek : y Y} denotes the restriction of the target space to its kth component.

Raman, Subedi, and Tewari

For a function f F, we use fk(x) := f(x), ek to denote the kth coordinate output of f(x). On the other hand, we use yk := y, ek to denote kth coordinate of y Y. Additionally, it is useful to distinguish between the range space Y and the image of functions f F. We deﬁne the image of function class F as im(F) := f F im(f), where im(f) = {f(x) : x X}. Finally, we take [N] := {1, 2, . . . , N}. In this work, we only consider bounded, non-negative loss functions ℓ: Y Y R 0 that satisfy the identity of the indiscernibles. For the remainder of the paper, we drop the adjectives bounded and non-negative when referring to loss functions.

Deﬁnition 1 (Identity of Indiscernibles). A loss function ℓ: Y Y R 0 satisﬁes identity of indiscernibles whenever ℓ(y1, y2) = 0 if and only if y1 = y2.

Note that if ℓ1 and ℓ2 are two losses deﬁned on Y Y that satisfy the identity of indiscernibles, then ℓ1(y1, y2) = 0 if and only if ℓ2(y1, y2) = 0. We also deﬁne a notion of approximate subadditivity, although not all the loss functions we consider have this property.

Deﬁnition 2 (c-subadditive). A loss function ℓis c-subadditive if there exists a constant c > 0 that only depends on the loss function ℓsuch that ℓ(y1, y2) c ℓ(y1, y) + ℓ(y, y2) for all y, y1, y2 Y.

If |Y| < , ℓbeing c-subadditive is an immediate consequence of ℓsatisfying the identity of indiscernibles. In fact, the value of c in this case is maxr =t ℓ(r,t)

minr =t ℓ(r,t) . To see why this is true, it suﬃces to only consider the case when ℓ(y1, y2) > ℓ(y, y2) because the inequality is trivially true otherwise. Since the loss values are distinct, we must have y = y1. Using the identity of indiscernible, we obtain ℓ(y1, y) minr =t ℓ(r, t), thus implying that c ℓ(y1, y) maxr =t ℓ(r, t) ℓ(y1, y2). The case when |Y| = is a bit delicate because minr =t ℓ(r, t) may not exist. So, one needs extra structure in the loss function to infer c-subadditivity. For instance, if ℓis a distance metric, then it is trivially 1-subadditive due to the triangle inequality.

2.1 Batch Setting

In the batch setting, we are interested in characterizing the learnability of F under the classical PAC models: both in the original realizable formulation (Valiant, 1984) and in the agnostic extension (Kearns et al., 1994).

Deﬁnition 3 (Agnostic Multioutput Learnability). A function class F is agnostic learnable with respect to loss ℓ: Y Y R 0, if there exists a function m : (0, 1)2 N and a learning algorithm A : (X Y) YX with the following property: for every ϵ, δ (0, 1) and for every distribution D on X Y, running algorithm A on n m(ϵ, δ) iid samples from D outputs a predictor g = A(S) such that with probability at least 1 δ over S Dn,

ED[ℓ(g(x), y)] inf f F ED[ℓ(f(x), y)] + ϵ.

Note that we do not require the output predictor A(S) to be in F, but only require A(S) to compete with the best predictor in F. If we restrict distribution D to a class such that inff F ED[ℓ(f(x), y)] = 0, then we get realizable learnability.

Multioutput Learnability

The learnability of a function class is generally characterized in terms of the complexity measure of the function class. As stated in the introduction, the VC dimension characterizes the learnability of binary function classes (Vapnik and Chervonenkis, 1974), and so does the Natarajan dimension for multiclass classiﬁcation (Ben-David et al., 1995). In a scalarvalued regression problem, the fat-shattering dimension of the function class characterizes learnability with respect to the absolute-value and squared loss (Bartlett et al., 1996; Alon et al., 1997). In particular, a real-valued function class G [0, B]X is learnable if and only if its fat-shattering dimension, denoted as fatγ(G), is ﬁnite for every scale γ > 0. We extend this characterization to a wide range of loss functions in Lemma 11. Finally, for a realvalued class G, we also use a more general notion of complexity measure called Rademacher complexity, denoted as Rn(G), that provides a suﬃcient condition for learnability (Bartlett and Mendelson, 2003). Precise deﬁnitions of all these complexity measures are provided in Appendix A.1. One recurring theme in this work is to ﬁrst construct a realizable multioutput learner and then convert it into an agnostic multioutput learner. It is well known that realizable learnability and agnostic learnability are equivalent for multiclass classiﬁcation problems with respect to 0-1 loss, (see Ben-David et al. (1995), (Shalev-Shwartz and Ben-David, 2014, Theorem 6.7)). Lemma 4, which is an immediate consequence of (Hopkins et al., 2022, Theorem 18), extends this equivalence between realizable and agnostic learning to general loss functions and target spaces.

Lemma 4 (Hopkins et al. (2022)). Consider a function class F YX such that |im(F)| < and a general loss function ℓ: Y Y R 0 that is c-subadditive. Then, F is realizable PAC learnable with respect to ℓif and only if F is agnostic PAC learnable with respect to ℓ.

The result of (Hopkins et al., 2022, Theorem 18) is stated for the case when |Y| < . However, in the regression setting, we need a slightly general version of their result to handle α-discretized function classes Fα YX (see Proof of Theorem 12) where |im(Fα)| < but |Y| = | [0, 1]K| = . Nevertheless, the proof of Lemma 4 requires only a minor modiﬁcation to that of (Hopkins et al., 2022, Theorem 18). Given the central role of this result in our characterization, we provide full proof of Lemma 4 in Section 3.1.2. Finally, we note that agnostic-to-realizable conversions in the batch setting are also possible via boosting and compression-based arguments (Montasser et al., 2019; Attias and Hanneke, 2023). We use the conversion of Hopkins et al. (2022) due to its generality and simplicity.

2.2 Online Setting

In the online setting, an adversary plays a sequential game with the learner over T rounds. In each round t [T], an adversary selects a labeled instance (xt, yt) X Y and reveals xt to the learner. The learner makes a (potentially randomized) prediction ˆyt Y. Finally, the adversary reveals the true label yt, and the learner suﬀers the loss ℓ(yt, ˆyt), where ℓis some pre-speciﬁed loss function. Given a function class F YX , the goal of the learner is to output predictions ˆyt such that its cumulative loss is close to the best possible cumulative loss over functions in F. A function class is online learnable if there exists an algorithm such that for any sequence of labeled examples (x1, y1), ..., (x T , y T ), the diﬀerence in cumulative loss between its predictions and the predictions of the best possible function in F is small.

Raman, Subedi, and Tewari

Deﬁnition 5 (Online Multioutput Learnability). A multioutput function class F is online learnable with respect to loss ℓ, if there exists an (potentially randomized) algorithm A such that for any adaptively chosen sequence of labelled examples (xt, yt) X Y, the algorithm outputs A(xt) Y at every iteration t [T] such that

t=1 ℓ(A(xt), yt) inf f F

t=1 ℓ(f(xt), yt)

where the expectation is taken with respect to the randomness of A and that of the possibly adaptive adversary, and R(T) : N R+ is the additive regret: a non-decreasing, sub-linear function of T.

If it is guaranteed that the learner always observes a sequence of examples labeled by some function f F, then we say we are in the realizable setting. On the other hand, if the true label yt is not revealed to the learner in each round t [T] and the adversary only reveals the learner s loss ℓ(A(xt), yt) then we say we are in the bandit setting. The online learnability of scalar-valued function classes H YX has been characterized. For example, when Y is ﬁnite (i.e. Y = [K] for some K N), the Multiclass Littlestone Dimension (MCLdim) of H YX characterizes online learnability with respect to the 01 loss. A function class H is online learnable with respect to the 0-1 loss if and only if MCLdim(H) is ﬁnite (Daniely et al., 2011). Moreover, MCLdim(H) tightly captures the best achievable regret in both the realizable and agnostic settings (Daniely et al., 2011). When the label space Y is a bounded subset of R, the sequential fat-shattering dimension of a real-valued function class H RX , denoted fatseq γ (H), characterizes the online learnability of H with respect to the absolute value loss (Rakhlin et al., 2015a). Unlike the Littlestone dimension, note that the sequential fat-shattering dimension is a scale-sensitive dimension. That is, fatseq γ (H) is deﬁned at every scale γ > 0. Accordingly, a real-valued function class H RX is learnable with respect to the absolute value loss if and only if its sequential fat-shattering dimension is ﬁnite at every scale γ > 0 (Rakhlin et al., 2015a). We extend this characterization to a wide range of loss functions in Lemma 22. Beyond scalar-valued learnability, for any label space Y, function class H YX , and loss function ℓ: Y Y R 0, the sequential Rademacher complexity of the loss class ℓ H, denoted Rseq T (ℓ H), is a useful tool for providing suﬃcient conditions for online learnability (Rakhlin et al., 2015a). See Appendix A.2 for complete deﬁnitions. Like the batch setting, a key technique we use to prove online learnability is to ﬁrst construct a realizable online learner and then convert it into an agnostic online learner. However, unlike the batch setting, there is no known generic algorithm that converts a (potentially randomized) realizable online learner into an agnostic online learner. Thus, one of the contributions of this work is Theorem 13, informally stated below, which provides an online analog of the realizable-to-agnostic conversion from Hopkins et al. (2022).

Theorem. (Informal) Let F YX be a multioutput function class such that |im(F)| < and ℓ: Y Y R 0 be any bounded, c-subadditive loss function. If F is online learnable with respect to ℓin the realizable setting, then F is online learnable with respect to ℓin the agnostic setting.

Multioutput Learnability

3. Batch Multioutput Learnability

3.1 Batch Multilabel Classiﬁcation

In this section, we study the learnability of batch multilabel classiﬁcation. Accordingly, let Y = { 1, 1}K. First, we consider the learnability of a natural decomposable loss. Then, we extend the result to more general non-decomposable losses that satisfy the identity of indiscernible. We want to point out that a multilabel classiﬁcation with K labels can be viewed as a multiclass classiﬁcation with 2K labels. With this viewpoint, the Natarajan dimension of F continues to characterize the batch multilabel learnability for any loss satisfying the identity of indiscernibles (see (Ben-David et al., 1995, Section 4)). For the sake of completeness, we also provide proof of this characterization in Appendix B. However, it is more natural to view a multilabel classiﬁcation as K diﬀerent binary classiﬁcation problems as opposed to a multi-class classiﬁcation problem with 2K labels. Exploiting this natural decomposability of a multilabel function class, we relate the learnability of F to the learnability of each component Fk.

3.1.1 Characterizing Batch Learnability for the Hamming Loss

A canonical and natural loss function for multilabel classiﬁcation is the Hamming loss, deﬁned as ℓH(f(x), y) := PK i=1 1 fi(x) = yi , where f(x) = (f1(x), . . . , f K(x)) and y = (y1, . . . , y K). The following result establishes an equivalence between the learnability of F with respect to Hamming loss and the learnability of each Fk with respect to 0-1 loss.

Theorem 6. A function class F YX is agnostic PAC learnable with respect to ℓH if and only if each of Fk YX k is agnostic PAC learnable with respect to the 0-1 loss.

Proof We ﬁrst prove that learnability of each component is suﬃcient followed by the proof of necessity. Part 1: Suﬃciency. Here our goal is to prove that the agnostic PAC learnability of each Fk is suﬃcient for agnostic PAC learnability of F. Our proof is constructive: given oracle access to agnostic PAC learners Ak for each Fk with respect to 0-1 loss, we construct an agnostic PAC learner A for F with respect to ℓH. Let D be arbitrary distribution on X Y and S = {(xi, yi)}n i=1 Dn be iid samples from distribution D. Denote Dk to be the marginal distribution of D restricted to X Yk. Then, for all k [K], the marginal samples Sk = {(xi, yk i )}n i=1 with scalar-valued targets are iid samples from Dk. For each k [K], deﬁne hk = Ak(Sk) to be the hypothesis returned by algorithm Ak when trained on Sk. Let mk(ϵ, δ) denote the sample complexity of Ak. Since Ak is an agnostic PAC learner for Fk, we have that for n maxk mk( ϵ

K ), with probability at least 1 δ/K over samples Sk Dn k,

EDk h 1 n hk(x) = ykoi inf fk Fk EDk h 1 n fk(x) = ykoi + ϵ

Summing these risk bounds over all coordinates k and using union bounds over probabilities, we get that with probability at least 1 δ over samples S Dn, we obtain PK k=1 EDk 1 hk(x) = yk PK k=1 inffk Fk EDk 1 fk(x) = yk + ϵ. Now using the fact

Raman, Subedi, and Tewari

that F F1 . . . FK followed by the linearity of expectation gives

k=1 1 n hk(x) = yko#

k=1 1 n fk(x) = yko#

This completes our proof as it shows that the learning rule that concatenates the predictors returned by each Ak on the marginalized samples Sk is an agnostic PAC learner for F with respect to ℓH with sample complexity at most maxk mk(ϵ/K, δ/K). Part 2: Necessity. Next, we show that if F is learnable with respect to ℓH, then each Fk is PAC learnable with respect to the 0-1 loss. Our proof is again based on reduction: given oracle access to agnostic PAC learner A for F, we construct an agnostic PAC learner A1 for F1. A similar construction can be used for all other Fk s. Let D1 be arbitrary distribution on X Y1 and S = {(xi, y1 i )}n i=1 be iid samples from D1. In order to use the algorithm A, we ﬁrst augment the samples S to create samples with K-variate target. Deﬁne an augmented sample e S = {(xi, (y1 i , . . . , y K i ))}n i=1 such that yik { 1, 1} each with probability 1/2 for all i [n] and k {2, . . . , K} . Next, we run A on e S and obtain the hypothesis h = (h1, . . . , h K) = A(e S). We now show that h1 obtains agnostic PAC bounds. Consider a distribution e D on X Y such that a sample (x, (y1, . . . , y K)) from e D is obtained by ﬁrst sampling (x, y1) D1 and appending yk s sampled independently from uniform distribution on { 1, 1} for each k {2, . . . , K}. Let m(ϵ, δ, K) denote the sample complexity of A. Since A is an agnostic PAC learner for F, for n m(ϵ, δ, K), with probability at least 1 δ, we have

k=1 1 n hk(x) = yko#

inf f F E e D

k=1 1 n fk(x) = yko#

For k 2, since the target is chosen uniformly at random from { 1, 1}, the 0-1 risk of any predictor is 1/2. Therefore, the expression above can be written as

ED1 1 h1(x) = y1 +

k=2 1/2 inf f F

ED1 1 f1(x) = y1 +

which reduces to ED1[1 h1(x) = y1 ] inff1 F1 ED1[1 f1(x) = y1 ] + ϵ. Therefore, F1 is agnostic PAC learnable with respect to 0-1 loss with sample complexity at most m(ϵ, δ, K).

3.1.2 Characterizing Batch Learnability for General Losses

In this section, we characterize the learnability for general multilabel losses. Our main technical tool in characterizing the learnability for general loss functions is the equivalence between realizable and agnostic learning guaranteed by Lemma 4. Thus, we ﬁrst provide the proof of that lemma before we proceed further. Proof (of Lemma 4) Note that agnostic learnability implies realizable learnability by definition. So, it suﬃces to show that realizable learnability of F with respect to ℓimplies

Multioutput Learnability

agnostic learnability. Our proof here is constructive. That is, given a realizable algorithm A for F, we provide an algorithm, stated as Algorithm 1, that constructs an agnostic learner for F.

Algorithm 1 Agnostic learner for F with respect to ℓ Input: Realizable learner A for F with respect to ℓ, unlabeled samples SU DX , and diﬀerent labeled samples SL D independent from SU 1 Run A over all possible labelings of SU by F to construct a concept class

C(SU) := A SU, f(SU) | f F|SU .

2 Return ˆg C(SU) with the lowest empirical error over SL with respect to ℓ.

Let D be an arbitrary distribution over X Y. Deﬁne

f := inf f F ED[ℓ(f(x), y)]

to be the optimal predictor in F. Now consider a predictor g = A(SU, f (SU)) C(SU) returned by A when trained on samples SU labeled by f . Note that g exists in C(SU) because we consider all possible labelings of SU by F and there must be a sample labeled by f as well. Let m A(ϵ, δ, K) be the sample complexity of the algorithm A. Since A is a realizable learner for F, for |SU| m A ϵ

2, K with probability 1 δ/2 over sample SU DX , we have EDX [ℓ(g(x), f (x))] ϵ

where DX is the marginal distribution of D restricted to X and c is the subaddtivity constant of ℓ. Since the loss function is c-subadditive, for any (x, y) X Y, we have ℓ(g(x), y) ℓ(f (x), y) + c ℓ(g(x), f (x)) pointwise. Taking expectation with respect to (x, y) D, we obtain

ED[ℓ(g(x), y)] c EDX [ℓ(g(x), f (x))] + ED[ℓ(f (x), y)] ED[ℓ(f (x), y)] + ϵ

where the inequality holds with probability 1 δ/2. Thus, we have shown that there exists a predictor g C(SU) that achieves agnostic PAC bounds for F with respect to ℓ. Let ℓ( , ) b be the upper bound on ℓ. Recall that by Hoeﬀding s inequality and union bound, with probability at least 1 δ/2 over sample SL D, the empirical risk of every hypothesis in C(SU) on a sample of size 8b2

ϵ2 log 4|C(SU)|

δ is at most ϵ/4 away from its

population risk. So, if |SL| 8b2

ϵ2 log 4|C(SU)|

δ , then with probability at least 1 δ/2 over sample SL D, we have

(x,y) SL ℓ(g(x), y) ED [ℓ(g(x), y)] + ϵ

4 ED[ℓ(f (x), y)] + 3ϵ

Next, consider the predictor ˆg returned by Algorithm 1. Since it is an empirical risk minimizer, its empirical risk can be at most the empirical risk of g. Given that the population risk of ˆg can be at most ϵ/4 away from its empirical risk, we have that

ED [ℓ(ˆg(x), y)] ED[ℓ(f (x), y)] + 3ϵ

4 inf f F ED[ℓ(f(x), y)] + ϵ,

Raman, Subedi, and Tewari

where the second inequality above uses the deﬁnition of f . Note that this inequality holds with probability at least 1 δ, where the probability is taken over both samples SU and SL. Thus, we have shown that Algorithm 1 is an agnostic PAC learner for F with respect to ℓ. We now upper bound the sample complexity of Algorithm 1, denoted m(ϵ, δ, K) hereinafter. Note that m A(ϵ, δ, K) is at most the number of unlabeled samples required for the realizable algorithm A to succeed plus the number of labeled samples for the ERM step to succeed. Thus,

m(ϵ, δ, K) m A

ϵ2 log 4|C(SU)|

2, K log (|im(F)|) + log 4

where the second inequality follows due to |C(SU)| |im(F)||SU| and we need |SU| to be of size m A ϵ

With Lemma 4 in hand, we can now relate the learnability of H with respect to any ℓ satisfying identity of indiscernibles to the learnability of H with respect to ℓH. To that end, we prove a result establishing the equivalence of learnability between any two loss functions satisfying the identity of indiscernibles.

Lemma 7. Let ℓand ℓ be any two loss functions satisfying the identity of indiscernibles. Then, a function class F YX is agnostic PAC learnable with respect to ℓif and only if F YX is agnostic PAC learnable with respect to ℓ .

The key idea behind the proof of Lemma 7 is to use a realizable learner for ℓto construct a realizable learner for ℓ . This is possible because ℓ (y1, y2) = 0 if and only if ℓ(y1, y2) = 0 for any y1, y2 Y. Given such realizable learner for ℓ , Lemma 4 guarantees the existence of an agnostic learner for ℓ . Proof Since ℓand ℓ are arbitrary, it suﬃces to prove only one direction. So, let us assume that F is learnable with respect to ℓ. We will now show that F is learnable with respect to ℓ as well. First, we show this for any realizable distribution D with respect to ℓ . Since, for any y1, y2 Y, we have ℓ (y1, y2) = 0 if and only if ℓ(y1, y2) = 0, the distribution D is also realizable with respect to ℓ. Furthermore, as there are at most 22K distinct possible inputs to ℓ ( , ), the loss function can only take a ﬁnite number of values. So, we can always ﬁnd universal constants a > 0 and b > 0 (that only depends on ℓand ℓ ) such that aℓ ℓ bℓ. Given that F is learnable with respect to ℓ, there exists a learning algorithm A with the following property: for any ϵ, δ > 0, for S Dn, such that n = m A( ϵ

b, δ, K), the algorithm outputs a predictor h = A(S) such that, with probability 1 δ over S Dn, we have ED[ℓ(h(x), y)] ϵ

b. This inequality upon using the fact that ℓ (h(x), y) bℓ(h(x), y) pointwise reduces to ED[ℓ (h(x), y)] ϵ. Therefore, any realizable learner A for ℓis also a realizable learner for ℓ . Finally, as ℓ satisﬁes the identity of indiscernibles and thus c-subadditivity, Lemma 4 guarantees the existence of agnostic PAC learner B for F with respect to ℓ . In particular, one such agnostic PAC learner B is Algorithm 1 that has sample

Multioutput Learnability

m B(ϵ, δ, K) m A

2, K K + log 1

where c > 1 is the subadditivity constant of ℓ and m A( , , K) is the sample complexity of any realizable algorithm A.

An immediate consequence of Lemma 7 is that learnability with respect to any loss ℓsatisfying the identity of indiscernibles is equivalent to learnability with respect to the Hamming loss ℓH. Thus, given Theorem 6, we can deduce the following result.

Theorem 8. Let ℓbe any multilabel loss function satisfying the identity of indiscernibles. A function class F YX is agnostic PAC learnable with respect to ℓif and only if each restriction Fk YX k is agnostic PAC learnable with respect to the 0-1 loss.

Remark. Since the learnability of a binary function class with respect to the 0-1 loss is characterized by its VC dimension (Shalev-Shwartz and Ben-David, 2014, Theorem 6.7), Theorem 8 implies that F is learnable with respect to ℓsatisfying the identity of indiscernibles if and only if VC(Fk) < for each k [K].

3.2 Batch Multioutput Regression

In this section, we consider the case when Y = [0, 1]K RK for K N. For bounded targets (with a known bound), this target space is without loss of generality because one can always normalize each Yk to [0,1] by subtracting the lower bound and dividing by the upper bound of Yk. As usual, we consider an arbitrary multioutput function class F YX . Following our outline in classiﬁcation, we ﬁrst study the learnability of F under decomposable losses and then non-decomposable losses.

3.2.1 Characterizing Learnability for Decomposable Losses

A canonical loss for the scalar-valued regression is the absolute value metric, d1(fk(x), yk) := |fk(x) yk|. Analogously, we deﬁne dp(fk(x), yk) := |fk(x) yk|p for p > 1 are other natural scalar-valued losses. For multioutput regression, we consider decomposable losses that are natural multivariate extensions of the d1 metric. In particular, we consider loss functions with the following properties.

Assumption 1. The loss can be written as ℓ(f(x), y) = PK k=1 ψk d1(fk(x), yk) where for each k [K], the function ψk : R 0 R 0 is L-Lipschitz and satisﬁes ψk(0) = 0.

Here, ψk d1 is a composition function deﬁned as ψk d1(fk(x), yk) := ψk d1(fk(x), yk) . Note that ψk d1 is a large family of loss functions that eﬀectively contains all natural decomposable multioutput regression losses. For instance, taking ψk(z) = |z|p for p 1 gives ℓp norms raised to their p-th power. Considering ψk(z) = |z|2/2 1[|z| δ] + δ(|z| δ/2)1[|z| > δ] for some 1 > δ > 0 yields multivariate extension of Huber loss used for robust regression. One may also construct a multioutput loss by considering diﬀerent scalar-valued

Raman, Subedi, and Tewari

losses for each coordinate output. Next, we establish an equivalence between the learnability of F YX with respect to ℓand the learnability of each Fk with respect to the loss ψk d1.

Theorem 9. Let ℓbe any loss function that satisﬁes Assumption 1. Then, a function class F YX is agnostic learnable with respect to ℓif and only if each of Fk YX k is agnostic learnable with respect to ψk d1.

Proof The proof of the suﬃciency direction is similar to that of Theorem 6, so we defer it to Appendix C.1. We now focus on the necessity direction. To that end, we show that if F is agnostic learnable with respect to ℓ, then each Fk is agnostic learnable with respect to ψk d1. In particular, given oracle access to agnostic learner A for F, we construct agnostic learner A1 for F1. By symmetry, a similar reduction can then be used to construct an agnostic learner for each component Fk. Since we are given a sample with a single, univariate target, the main problem is to ﬁnd the right way to augment samples to a K-variate target. In the proof of Theorem 6, we showed that randomly choosing yik Uniform({ 1, 1}) for k 2 results in all predictors having a constant 1/2 risk leaving only the risk of the ﬁrst component on both sides. Unfortunately, in regression under general losses, no single augmentation works for every distribution on X. Thus, we augment the samples by considering all possible behaviors of (F2, . . . , FK) on the sample. Since the function class maps to a potentially uncountably inﬁnite space, we ﬁrst discretize each component of the function class and consider all possible labelings over the discretized space. Fix 1 > α > 0. For k 2, deﬁne the discretization

fα k (x) = f(x)

for every fk Fk and the discretized component class Fα k = {fα k |fk Fk}. Note that a function in Fk maps to {0, α, 2α, . . . , 1/α α} and the size of the range of the discretized function class Fα k is 1 + 1/α (α + 1)/α 2/α. For the convenience of exposition, let us deﬁne Fα 2:K to be Fα without the ﬁrst component, and we denote fα 2:K to be an element of Fα 2:K.

Algorithm 2 Agnostic learner for F1 with respect to ψ1 d1 Input: Agnostic learner A for F, samples S = (x1:n, y1 1:n) Dn 1 , and another independent samples e S from D1 1 Deﬁne Saug = {(x1:n, y1 1:n, fα 2:K(x1:n) | fα 2:K Fα 2:K}, all possible augmentations of S by Fα 2:K.

2 Run A over all possible augmentations to get C(S) := A Sa | Sa Saug .

3 Deﬁne C1(S) = {g1 | (g1, . . . , gk) C(S)}, a restriction of C(S) to its ﬁrst coordinate output.

4 Return ˆg1, the predictor in C1(S) with the lowest empirical error over e S with respect to ψ1 d1.

We now show that Algorithm 2 is an agnostic learner for F1. First, let us deﬁne

f 1 := arg min f1 F1 ED1[ψ1 d1(f1(x), y1)],

Multioutput Learnability

to be optimal predictor in F1 with respect to D1. By deﬁnition of F1, there must exist f 2:K F2:K such that (f 1 , f 2:K) F. We note that f k need not be optimal predictors in Fk for k 2, but we use the notation just to associate these component functions with the ﬁrst component function f 1 . Deﬁne f ,α 2:K Fα 2:K to be the corresponding discretization of f 2:K. At a high level, the key idea of this proof is to show that the algorithm A when run on the sample (x1:n, y1 1:n, f ,α 2:K(x1:n) produces a predictor g = A(x1:n, y1 1:n, f ,α 2:K(x1:n)) such that its restriction g1 is a valid agnostic learner for F1. Although one such augmentation is enough to produce an agnostic learner for F1, all possible augmentations are required in step 2 of Algorithm 2 because f 1 is not known to the learner apriori. We now make this argument precise. Suppose g = A((x1:n, y1 1:n, f ,α 2:K(x1:n)) is the predictor obtained by running A on the sample augmented by f ,α 2:K. Note that g C(S) by deﬁnition. Let m A(ϵ, δ, K) be the sample complexity of A. Since A is an agnostic learner for F with respect to ℓ, we have that for n m A(ϵ/4, δ/2, K), with probability at least 1 δ/2,

ED1 h ψ1 d1(g1(x), y1) i +

k=2 EDX ψk d1(gk(x), f ,α k (x))

ED1 ψ1 d1(f1(x), y1) +

k=2 EDX ψk d1(fk(x), f ,α k (x)) !

Note that the quantity on the left is trivially lower bounded by the risk of the ﬁrst component. To handle the right-hand side, we ﬁrst note that the optimal risk is trivially upper bounded by the risk of (f 1 , f 2:K), yielding

ED1 ψ1 d1(g1(x), y1) ED1 ψ1 d1(f 1 (x), y1) +

k=2 EDX ψk d1(f k(x), f ,α k (x)) + ϵ

Next, using the L-Lipschitzness of ψk and the fact that ψk(0) = 0 implies ψk d1(f k(x), f ,α k (x)) L d1(f k(x), f ,α k (x)) Lα for all k 2. Thus, picking α = ϵ 4LK and using the deﬁnition of f 1 , we obtain

ED1 ψ1 d1(g1(x), y1) inf f1 F1 ED1[ψ1 d1(f1(x), y1)] + ϵ

Therefore, we have shown the existence of one predictor g C(S) such that its restriction to the ﬁrst component, g1, obtains the agnostic bound. Note that since ψ1 is L-Lipschitz and satisﬁes ψ1(0) = 0, we obtain that ψ1 d1( , ) L. The upper bound also uses the fact that |f1(x) y1| 1. Now recall that by Hoeﬀding s Inequality and union bound, with probability at least 1 δ/2, the empirical risk of every hypothesis in C1(S) on a sample of size 8L2

ϵ2 log 4|C1(S)|

δ is at most ϵ/4 away from its true error. So, if |e S| 8L2

ϵ2 log 4|C1(S)|

δ , then with probability at least 1 δ/2, the empirical risk of the predictor g1 is

(x,y1) e S ψ1 d1(g1(x), y1) ED1 ψ1 d1(g1(x), y1) + ϵ

4 inf f1 F1 ED1[ψ1 d1(f1(x), y1)]+3ϵ

Raman, Subedi, and Tewari

Since ˆg1, the output of Algorithm 2 is the ERM on e S over C1(S), its empirical risk can be at most the empirical risk of g1, which is at most inff1 F1 ED1[ψ1 d1(f1(x), y1)] + 3ϵ

4 . Given that the population risk of ˆg1 is at most ϵ/4 away from its empirical risk, we can conclude that the population risk of ˆg1 is

ED1[ψ1 d1(ˆg1(x), y1)] inf f1 F1 ED1[ψ1 d1(f1(x), y1)] + ϵ.

Applying union bounds, the entire process, algorithm A on the dataset augmented by f ,α 2:K and ERM in step 4, succeeds with probability 1 δ. The sample complexity of Algorithm 2 is the sample complexity of Algorithm A and the sample complexity of ERM in step 4, which is

m A(ϵ/4, δ/2, K) + 8L2

ϵ2 log 4|C1(S)|

m A(ϵ/4, δ/2, K) + 8KL2

m A(ϵ/4, δ/2, K) log 4K L

where the second inequality follows due to |C1(S)| (2/α)m A(ϵ/4,δ/2,K) K is the required size of C1(S) and our choice of α = ϵ/(4KL). This completes the proof as we have shown that Algorithm 2 is an agnostic learner for F1 with respect to ψ1 d1.

3.2.2 A More General Characterization of Learnability for Decomposable Losses

Unlike Theorems 6 and 8 in classiﬁcation setting where we connected the learnability of F to the learnability of Fk s with respect to 0-1 loss, Theorem 9 relates the learnability of F to the learnability of Fk with respect to ψk d1 instead of the more canonical loss d1. In this section, we complete that ﬁnal step to characterizing learnability in terms of d1 with an additional assumption on ψk.

Assumption 2. For all k [K], the function ψk : R 0 R 0 is monotonic.

Under these assumptions, Theorem 10 provides a more general characterization than Theorem 9.

Theorem 10. Let ℓbe any loss function that satisﬁes Assumptions 1 and 2. Then, a multioutput function class F ([0, 1]K)X is agnostic learnable with respect to ℓif and only if each Fk [0, 1]X is agnostic learnable with respect to d1.

Since the fat-shattering dimension of a real-valued function class characterizes its learnability with respect to d1 loss, Theorem 10 implies that a multioutput function class F is learnable with respect to ℓsatisfying Assumptions 1 and 2 if and only if fatγ(F) < for every ﬁxed scale γ > 0. Theorem 10 is an immediate consequence of Theorem 9 and the following lemma, the proof of which is deferred to Appendix C.2.

Lemma 11. Let Y = [0, 1] be the label space and ψ : R 0 R 0 be any Lipschitz and monotonic function that satisﬁes ψ(0) = 0. A scalar-valued function class G [0, 1]X is agnostic learnable with respect to ψ d1 if and only if G is agnostic learnable with respect to d1.

Multioutput Learnability

The part of the lemma showing that d1 learnability implies ψ d1 is trivial using the Rademacher-based argument and Talagrand s contraction lemma. However, proving ψ d1 learnability implies d1 learnability is non-trivial. The case ψ(z) = |z|2 is considered in (Anthony and Bartlett, 1999, Theorem 19.5), but their proof requires a mismatch between the label space and the prediction space, namely Y = [ 1, 2] but G [0, 1]X . In this work, we improve their result by showing the equivalence between ψ d1 learnability and d1 learnability without requiring extended label space. An application of Lemma 11 is that the learnability of a real-valued function class G with respect to losses d1 and dp are equivalent for any p > 1.

3.2.3 Characterizing Learnability for Non-Decomposable Losses

Next, we study the learnability of function class F with respect to non-decomposable losses. In the regression setting, the natural non-decomposable loss to consider is ℓp norm, which

is deﬁned as ℓp(f(x), y) := P k=1 |fk(x) yk|p 1/p for 1 p < . For p = , the p-norm is deﬁned as ℓ (f(x), y) := maxk |fk(x) yk|. One might be interested in ℓp norms instead of their decomposable counterparts ℓp p losses discussed in the previous section mainly for robustness to outliers. The following result characterizes the agnostic learnability of F with respect to ℓp norms.

Theorem 12. Fix p 1. The function class F YX is agnostic learnable with respect to ℓp norm if and only if each of Fk YX k is agnostic learnable with respect to the absolute value loss, d1.

Using ψk(z) = |z| in Theorem 9 implies that F is learnable with respect to ℓ1 if and only if each Fk is learnable with respect to d1. Thus, to prove Theorem 12, it suﬃces to show that for all p > 1, F is learnable with respect to ℓp if and only if F is learnable with respect to ℓ1 norm. Proof We begin by proving the suﬃciency direction. As discussed above, the learnability of each Fk with respect to d1 implies that F is learnable with respect to ℓ1. Then, the highlevel idea of the proof is to use an agnostic learner for F with respect to ℓ1 to construct a realizable learner for Fα with respect to ℓ1. Using the fact ℓ1(y1, y2) = 0 if and only if ℓp(y1, y2) = 0 for any y1, y2 Y, the realizable learner for ℓ1 is also a realizable learner for ℓp. Finally, as |im(Fα)| < for every α > 0, Lemma 4 guarantees the existence of an agnostic learner for Fα with respect to ℓp. Then, a simple application of the triangle inequality shows that an agnostic learner for Fα is also an agnostic learner for F with respect to ℓp. We now proceed with the formal proof. Part 1: Suﬃciency. Fix p > 1. Recall that the learnability of each Fk with respect to d1 implies that F is learnable with respect to ℓ1. Let D be arbitrary distribution on X Y and A be an agnostic PAC learner for F with respect to ℓ1 with sample complexity m(ϵ, δ) . For any ϵ, δ > 0, with n suﬃciently larger than m(ϵ/2, δ, K), with probability at least 1 δ over S Dn, we have

ED[ℓ1(g(x), y)] inf f F ED[ℓ1(f(x), y)] + ϵ

where g = A(S). Deﬁne Fα to be a discretized function class obtained by discretizing F component-wise using the scheme (1). Consider D to be a realizable distribution

Raman, Subedi, and Tewari

with respect to Fα. Note that the triangle inequality implies ℓ1(f(x), y) ℓ1(fα(x), y) + ℓ1(f(x), fα(x)) ℓ1(fα(x), y)+Kα. Taking α = ϵ 2K and using the fact that inff F E[ℓ1(fα(x), y)] = 0, we obtain ED[ℓ1(g(x), y)] ϵ. Next, using the inequality ℓp(g(x), y) ℓ1(g(x), y) pointwise yields ED[ℓp(g(x), y)] ϵ.

Therefore, A is a realizable learner for Fα with respect to ℓp with sample complexity m(ϵ/2, δ, K). Since |im(Fα)| < and ℓp( , ) are 1-subadditive, Lemma 4 implies that Fα is agnostic learnable with respect to ℓp via Algorithm 1, referred to as algorithm B henceforth. Thus, for any ϵ, δ > 0, there exists a n m B(ϵ/2, δ/2), for any distribution e D on X Y, running B on S Dn outputs a predictor g YX such that with probability at least 1 δ over S e Dn, we have

E e D[ℓp( g(x), y)] inf fα Fα E e D[ℓp(fα(x), y)] + ϵ

2 = inf f F E e D[ℓp(fα(x), y)] + ϵ

Using triangle inequality, we have ℓp(fα(x), y) ℓp(f(x), y)+ℓp(fα(x), f(x)) ℓp(f(x), y)+ αK pointwise. Again taking α = ϵ 2K , we obtain

E e D[ℓp( g(x), y)] inf f F E e D[ℓp(f(x), y)] + ϵ.

Therefore, we have shown that F is agnostic PAC learnable with respect to ℓp. The sample complexity of agnostic learner B can be made precise using the sample complexity of Algorithm 1. In particular, the sample complexity of B is the sample complexity of the realizable learner A and the sample complexity of the ERM in step 2 of Algorithm 1. Proof of Lemma 4 shows that the sample complexity of B must be

m B(ϵ, δ, K) m A

2, K log |(im(Fα)|) + log 4

where b is the upperbound on the loss and c is the subadditivity constant. Since b K, and c = 1 for all ℓp norms with p 1, we obtain

m B(ϵ, δ, K) m A(ϵ/2, δ/2, K) + 8K3

m A(ϵ/2, δ/2, K) log 4K

where we also use the fact that |im(Fα)| (2/α)K for our choice of α = ϵ 2K . Part 2: Necessity. Fix p > 1. We now prove that F being learnable with respect to ℓp implies F is learnable with respect to ℓ1. The proof is identical to the proof of suﬃciency, so we only provide a sketch of the argument here. Our proof strategy follows a similar route through realizable learnability of the discretized class Fα and then the use of Lemma 4. Recall that any agnostic learner A for F with respect to ℓp is a realizable learner for Fα

with respect to ℓp. Using the inequality ℓ1( , ) Kℓp( , ) pointwise, we can deduce that A is also a realizable learner for Fα with respect to ℓ1. Since |im(Fα)| (2/α)K < , Lemma 4 guarantees existence of an agnostic learner for Fα with respect to ℓ1. Using triangle inequality, we obtain ℓ1(fα(x), y) ℓ1(f(x), y) + ℓ1(fα(x), f(x)) ℓ1(f(x), y) + αK pointwise, and choosing appropriate discretization scale allows us to turn agnostic bound for Fα into an agnostic bound for F.

Multioutput Learnability

Remark. We note that Theorem 12 holds for any norm on RK, but we only focus on ℓp norms here due to their practical signiﬁcance. As the fat-shattering dimension of a realvalued function class characterizes its learnability with respect to d1 loss (Bartlett et al., 1996), Theorem 12 implies that a multioutput function class F is learnable with respect to ℓp for 1 p if and only if fatγ(Fk) < for all k [K] at every ﬁxed scale γ > 0. Since we are only concerned with the question of learnability in this work, our focus is not on optimal sample complexity rates. However, we point out that for any p 1, if each Fk is learnable with respect to d1, then F is learnable with respect to ℓp via ERM with a better sample complexity than Algorithm 1. The proof of this claim is based on Rademacher complexity and is provided in Appendix D.

4. Online Multioutput Learnability

Here, we study the online learnability of multioutput function classes. Throughout this section, we give regret bounds assuming an oblivious adversary. A standard reduction (Chapter 4 in Cesa-Bianchi and Lugosi (2006)) allows us to convert oblivious regret bounds to adaptive regret bounds in the full-information setting. A key requirement allowing an oblivious regret bound to generalize to an adaptive regret bound is that the learner s predictions on round t should not depend on any of its past predictions from previous rounds. This is true for all of the online learning algorithms in this section.

4.1 Online Agnostic-to-Realizable Reduction

Our strategy for constructively characterizing the learnability of general losses in both the batch classiﬁcation and regression setting required the ability to convert a realizable learner to an agnostic learner in a black-box fashion. In this section, we provide an analog of this conversion for the online setting. More speciﬁcally, we focus on the setting where F YX

is a multioutput function class but |im(F)| < is ﬁnite. Then, for any c-subadditive loss function ℓ, we constructively convert a (potentially randomized) realizable online learner for F with respect to ℓinto an agnostic online learner for F with respect to ℓ. Theorem 13 formalizes the main result of this subsection.

Theorem 13. Let F YX be a multioutput function class such that |im(F)| < and ℓ: Y Y R 0 be any c-subadditive loss function such that ℓ( , ) M. If A is a realizable online learner for F with respect to ℓwith sub-linear expected regret R(T, |im(F)|), then for every β (0, 1), there exists an agnostic online learner for F with respect to ℓwith expected regret c T T β R(T β, |im(F)|) + M q

2T 1+β ln(|im(F)|),

where R(T, |im(F)|) is any concave, sublinear upperbound on R(T, |im(F)|).

Note that if R(T β, |im(F)|) is sublinear in its ﬁrst argument, then c T

T β R(T β, |im(F)|) + M p

2T 1+β ln(|im(F)|) is sublinear in T for any β (0, 1). By Lemma 14, we are guaranteed the existence of R(T, |im(F)|).

Raman, Subedi, and Tewari

Lemma 14. (Ceccherini-Silberstein et al., 2017, Lemma 5.17) Let g be a positive sublinear function. Then, g is bounded from above by a concave sublinear function.

Therefore, Theorem 13 and Lemma 14 show that for any function class F with ﬁnite image space and any c-subadditive loss function, realizable and agnostic online learnability are equivalent. We now begin the proof of Theorem 13. Proof Let A be a (potentially randomized) online realizable learner for F with respect to ℓ. By deﬁnition, this means that for any (realizable) sequence (x1, f(x1)), ..., (x T , f(x T )) labeled by a function f F, we have

t=1 ℓ(A(xt), f(xt))

R(T, |im(F)|),

where R(T, |im(F)|) is a sub-linear function of T. We now use A to construct an agnostic online learner Q for F with respect to ℓ. Since we are assuming an oblivious adversary, let (x1, y1), ..., (x T , y T ) (X Y)T denote the stream of points to be observed by the online learner and f = arg minf F PT t=1 ℓ(f(xt), yt) to be the optimal function in hindsight. Our high-level strategy is to construct a large set of Experts that approximately cover all possible labelings of the instances x1, ..., x T by functions in F. In particular, each Expert uses an independent copy of A to make predictions, but update A using diﬀerent sequences of labeled instances. Together, our set of Experts update A using all possible sequences of labeled instances. In order to ensure that the number of Experts is not too large, we construct such a set of Experts over a suﬃciently small sub-sample of the stream. Finally, we run the celebrated Randomized Exponential Weights Algorithm (REWA) (Cesa-Bianchi and Lugosi, 2006) using our set of experts and the scaled loss function ℓ M over the original stream of points (x1, y1), ..., (x T , y T ). We now formalize this idea below. For any bitstring b {0, 1}T , let φ : {t : bt = 1} im(F) denote a function mapping time points where bt = 1 to elements in the image space im(F). Let Φb (im(F)){t:bt=1}

denote all such functions φ. For every f F, let φf b Φb be the mapping such that for all t {t : bt = 1}, φf b (t) = f(xt). Let |b| = |{t : bt = 1}|. For every b {0, 1}T

and φ Φb, deﬁne an Expert Eb,φ. Expert Eb,φ, formally presented in Algorithm 3, uses A to make predictions in each round. However, Eb,φ only updates A on those rounds where bt = 1, using φ to compute a labeled instance (xt, φ(t)). For every b {0, 1}T , let Eb = S φ Φb{Eb,φ} denote the set of all Experts parameterized by functions φ Φb. If b is the all zeros bitstring, then Eb is empty. Therefore, we actually deﬁne Eb = {E0} S φ Φb{Eb,φ}, where E0 is the expert that never updates A and plays ˆyt = A(xt) for all t [T]. Note that 1 |Eb| (|im(F)|)|b|. With this notation in hand, we are now ready to present Algorithm 4, our main agnostic online learner Q for F with respect to ℓ. Our goal is to show that Q enjoys sublinear expected regret. There are three main sources of randomness: the randomness involved in sampling B, the internal randomness of A, and the internal randomness of REWA. Let B, A and P denote the random variables associated with each source of randomness respectively. By construction, B, A, and P are independent.

Multioutput Learnability

Algorithm 3 Expert(b, φ) Input: Independent copy of realizable online learner A for F with respect to ℓ

1 for t = 1, ..., T do

2 Receive example xt 3 Predict ˆyt = A(xt)

4 if bt = 1 then

5 Update A by passing (xt, φ(t))

Algorithm 4 Agnostic online learner Q for F with respect to ℓ Input: Parameter 0 < β < 1

1 Let B {0, 1}T such that Bt iid Bernoulli(T β

2 Construct the set of experts EB = {E0} S φ ΦB{EB,φ} according to Algorithm 3

3 Run REWA P using EB and the loss function ℓ M over the stream (x1, y1), ..., (x T , y T )

Using Theorem 21.11 in Shalev-Shwartz and Ben-David (2014) and the fact that B, A and P are independent, REWA guarantees almost surely that

t=1 E [ℓ(P(xt), yt)|B, A] inf E EB

t=1 ℓ(E(xt), yt) + M p

2T ln(|EB|).

Taking an outer expectation gives

t=1 ℓ(P(xt), yt)

t=1 ℓ(E(xt), yt)

2T ln(|EB|) i .

t=1 ℓ(Q(xt), yt)

t=1 ℓ(P(xt), yt)

t=1 ℓ(E(xt), yt)

2T ln(|EB|) i

t=1 ℓ(EB,φf B (xt), yt)

2T ln(|EB|) i .

In the last step, we used the fact that for all b {0, 1}T and f F, we have Eb,φf b Eb.

It now suﬃces to upperbound E h PT t=1 ℓ(EB,φf B (xt), yt) i . To do so, we need some additional notation. Given the realizable online learner A, an instance x X, and an ordered ﬁnite sequence of labeled examples L (X Y) , let A(x|L) be the random variable denoting the prediction of A on the instance x after running and updating on L. For any b {0, 1}T , f F, and t [T], let Lf b<t = {(xi, f(xi)) : i < t and bi = 1} denote

Raman, Subedi, and Tewari

the subsequence of the sequence of labeled instances {(xi, f(xi))}t 1 i=1 where bi = 1. Using this notation, we can write

t=1 ℓ(EB,φf B (xt), yt)

t=1 ℓ(A(xt|Lf B<t), yt)

t=1 ℓ(A(xt|Lf B<t), yt)P [Bt = 1]

t=1 E h ℓ(A(xt|Lf B<t), yt)P [Bt = 1] i

t=1 E h ℓ(A(xt|Lf B<t), yt)1{Bt = 1} i .

To see the last equality, note that the prediction A(xt|Lf B<t) only depends on bitstring (B1, . . . , Bt 1) and the internal randomness of A, both of which are independent of Bt. Thus, we have

E h ℓ(A(xt|Lf B<t), yt) 1{Bt = 1} i = E h ℓ(A(xt|Lf B<t), yt) i E [1{Bt = 1}]

= E h ℓ(A(xt|Lf B<t), yt) i P[Bt = 1]

as needed. Continuing onwards,

t=1 ℓ(EB,φf B (xt), yt)

t=1 ℓ(A(xt|Lf B<t), yt)1{Bt = 1}

t:Bt=1 ℓ(A(xt|Lf B<t), yt)

t:Bt=1 ℓ(A(xt|Lf B<t), f (xt))

t:Bt=1 ℓ(f (xt), yt)

t:Bt=1 ℓ(A(xt|Lf B<t), f (xt))

t=1 ℓ(f (xt), yt)

t:Bt=1 ℓ(A(xt|Lf B<t), f (xt))

t=1 ℓ(f(xt), yt)

The inequality follows from the fact that ℓis a c-subadditive and the last equality follows from the deﬁnition of f . We now need to bound c T T β E h P t:Bt=1 ℓ(A(xt|Lf B<t), f (xt)) i . Using the fact that A is a realizable online learner and gets updated on a stream of

Multioutput Learnability

instances labeled by f only on rounds where Bt = 1, we get

t:Bt=1 ℓ(A(xt|Lf B<t), f (xt))

t:Bt=1 ℓ(A(xt|Lf B<t), f (xt)) B

T β E [R(|B|, |im(F)|)] .

Putting things together, we ﬁnd that,

t=1 ℓ(Q(xt), yt)

t=1 ℓ(EB,φf B (xt), yt)

2T ln(|EB|) i

t=1 ℓ(f(xt), yt) + c T

T β E [R(|B|, |im(F)|)] + ME hp

2T ln(|EB|) i

t=1 ℓ(f(xt), yt) + c T

T β E [R(|B|, |im(F)|)] + ME hp

2T|B| ln(|im(F)|) i ,

where the last inequality follows from the fact that that |EB| (|im(F)|)|B|. By Jensen s inequality, we further get that, E hp

2T|B| ln(|im(F)|) i p

2T β+1 ln(|im(F)|), which implies that

t=1 ℓ(Q(xt), yt)

t=1 ℓ(f(xt), yt)+ c T

T β E [R(|B|, |im(F)|)]+M q

2T β+1 ln(|im(F)|).

Next, by Lemma 14, there exists a concave sublinear function R(|B|, |im(F)|) that upperbounds R(|B|, |im(F)|). By Jensen s inequality, we obtain E[R(|B|, |im(F)|)] R(T β, |im(F)|), which yields

t=1 ℓ(Q(xt), yt)

t=1 ℓ(f(xt), yt) + c T

T β R(T β, |im(F)|) + M q

2T β+1 ln(|im(F)|).

This completes the proof as we have shown that Q is an agnostic online learner for F with respect to ℓwith the stated regret bound.

4.2 Online Multilabel Classiﬁcation

Let Y = { 1, 1}K. We provide analogs of Theorem 6 and 8 in the online setting. We begin by characterizing the learnability of the Hamming loss and then move to give a characterization of learnability for all losses satisfying the identity of indiscernibles. Similar to the batch setting, we can show that the MCLdim of F characterizes online multilabel learnability (see Appendix F), but here, we give a characterization that better exploits the multilabel structure of the problem.

Raman, Subedi, and Tewari

4.2.1 Characterizing Online Learnability for the Hamming Loss

Theorem 15 characterizes the online learnability of a multilabel function class F with respect to ℓH.

Theorem 15. A function class F YX is online learnable with respect to the Hamming loss if and only if each restriction Fk YX k is online learnable with respect to the 0-1 loss.

The proof of Theorem 15 is similar to that of Theorem 6, so we defer the full proof to Appendix E and only provide a sketch here. The proof of suﬃciency direction is based on a reduction: given oracle access to online learners {Ak}K k=1 for {Fk}K k=1 with respect to ℓ0-1, we construct an online learner A for F with respect to ℓH. In fact, similar to the batch setting, the online multilabel learning algorithm A is simple: in each round t [T], receive xt, query the predictions A1(xt), ..., AK(xt), and ﬁnally predict the concatenation ˆyt = (A1(xt), ..., AK(xt)). Once the true label yt = (y1 t , ..., y K t ) is revealed, update each online learner Ak by passing (xt, yk t ) for k [K]. Using some algebra, one can show that this prediction rule achieves sublinear regret for F. For the necessity direction, given oracle access to an online learner A for F with respect to ℓH, we construct an online learner B for F1 with respect to ℓ0-1. A similar reduction can be used to construct online learners for each restriction Fk. Similar to the batch setting, the online learning algorithm B is simple: in each round t [T], receive xt, query ˆyt = A(xt) and predict ˆy1 t = A1(xt). Once the true label y1 t is revealed, update A by passing (xt, yt) where yt = (y1 t , σ2 t , ..., σK t ) and {σi t}K i=2 is an i.i.d sequence of Rademacher random variables. A straightforward analysis shows that such a prediction rule achieves sublinear regret for F1.

4.2.2 Characterizing Online Learnability for General Losses

Using Theorem 13 and Theorem 15, we now characterize the learnability of arbitrary multilabel loss functions ℓas long as they satisfy the identity of indiscernibles. The key idea is that since there are only ﬁnite number of possible inputs to ℓ, for any ℓsatisfying the identity of indiscernibles, there must exist universal constants a and b such that aℓH(y1, y2) ℓ(y1, y2) bℓH(y1, y2). Then, we can characterize the learnability of ℓby relating it to the learnability of ℓH. In fact, we prove a slightly more general result, showing an equivalence between the learnability of any two arbitrary losses satisfying the identity of indiscernibles.

Lemma 16. Let ℓand ℓ be any two loss functions satisfying the identity of indiscernibles. A function class F YX is online learnable with respect to ℓif and only if F is online learnable with respect to ℓ .

The proof of Lemma 16 is similar to that of Lemma 7 with the main diﬀerence being the use of Theorem 13 instead of Lemma 4. Since Lemma 16 is our ﬁrst application of Theorem 13, we provide the full proof here. Proof Since ℓand ℓ are arbitrary, it suﬃces to prove only one direction. To that end, suppose F is online learnable with respect to ℓ. We now show that F is online learnable with respect to ℓ as well.

Multioutput Learnability

Let a and b be the universal constants such that for all y1, y2 Y, aℓ(y1, y2) ℓ (y1, y2) bℓ(y1, y2). Let c = maxr =t ℓ (r,t)

minr =t ℓ (r,t) . Since |im(F)| = 2K < and ℓ is a c-subadditive, by Theorem 13, it suﬃces to give a realizable online learner for F with respect to ℓ . Since F is online learnable with respect to ℓ, there exists an algorithm A such that for any sequence (x1, y1), ..., (x T , y T ), we have

t=1 ℓ(A(xt), yt) inf f F

t=1 ℓ(f(xt), yt)

where R(T, 2K) is a sublinear function of T. In the realizable setting, we are guaranteed that for any sequence (x1, y1), ..., (x T , y T ) that the online learner may observe, there exists a f F s.t f(xt) = yt for all t [T]. Since ℓsatisﬁes the identity of indiscernibles, we have that for any realizable sequence (x1, y1), ..., (x T , y T ), inff F PT t=1 ℓ(f(xt), yt) = 0. Thus, we

have that E h PT t=1 ℓ(A(xt), yt) i R(T, 2K). Noting that ℓ(A(xt), yt) ℓ (A(xt),yt)

that E h PT t=1 ℓ (A(xt), yt) i b R(T, 2K), showing that A is also a realizable online learner

for F with respect to ℓ . For any β (0, 1), the construction in Theorem 13 can then be used to convert A into an agnostic online learner for F with respect to ℓ with expected regret bound cb T

T β R(T β, 2K) + M

where M is such that ℓ M and R(T β, 2K) is any concave sublinear upperbound of R(T β, 2K). This completes our proof.

As an immediate consequence of Lemma 16 and Theorem 15, we get the following theorem characterizing the online learnability of general multilabel losses.

Theorem 17. Let ℓbe any multilabel loss function that satisﬁes the identity of indiscernibles. A function class F YX is online learnable with respect to ℓif and only if each restriction Fk YX k is online learnable with respect to the 0-1 loss.

Remark. Since the Littlestone dimension characterizes online learnability for binary classiﬁcation under the 0-1 loss (Ben-David et al., 2009), Theorem 17 also implies that ﬁniteness of Ldim(Fk) for all k [K] is a necessary and suﬃcient condition for online multilabel learnability. Moreover, if Ldim(Fk) < for all k [K], then we have MCLdim(F) < . This follows from the fact that MCLdim(F) PK k=1 Ldim(Fk). To see this, note that MCLdim(F) is the lowerbound on the number of mistakes of any deterministic multiclass learner in the realizable setting (Daniely et al., 2011, Theorem 17). On the other hand, one can construct a deterministic realizable learner for F using K diﬀerent Standard Optimal Algorithms (SOA) for binary function classes Fk s. Namely, deﬁne an algorithm A such that A(x) := (SOA(F1)(x), . . . , SOA(FK)(x)) { 1, 1}K. Since each SOA(Fk) makes at most Ldim(Fk) number of mistakes, A makes no more than PK k=1 Ldim(Fk) mistakes. We can use this fact to give an improved version of Theorem 13 for classes F

Raman, Subedi, and Tewari

with MCLdim(F) < . In particular, when Y = { 1, 1}K, any F YX that is learnable in the realizable setting with respect to ℓis also learnable in the agnostic setting with regret

TMCLdim(F) ln(T) O B q

T PK k=1 Ldim(Fk) ln(T) . Here, B is the maximum value ℓcan take. The improved regret bound can be found in the suﬃciency proof of Theorem 34 in Appendix F.

4.3 Bandit Online Multilabel Classiﬁcation

We extend the results in the previous subsection to the online setting where the learner only observes bandit feedback in each round. Theorem 18 gives a characterization of bandit online learnability of a function class F in terms of the online learnability of each restriction.

Theorem 18. Let ℓbe any loss function that satisﬁes the identity of indiscernibles. A function class F YX is bandit online learnable with respect to ℓif and only if each restriction Fk YX k is online learnable with respect to the 0-1 loss.

Similar to the full-feedback setting, Theorem 18 also gives that the ﬁniteness of Ldim(Fk) for all k [K] is a necessary and suﬃciency condition for bandit online multilabel learnability. The proof of Theorem 18 uses the realizable-to-agnostic conversion for bandit feedback setting when the label space Y is ﬁnite. The following Theorem makes this argument precise.

Theorem 19. Let Y be a ﬁnite label space, F YX a multioutput function class, and ℓ: Y Y R 0 be any c-subadditive loss function such that ℓ( , ) M. If A is a realizable online learner for F with respect to ℓunder full-feedback with sub-linear expected regret R(T, |Y|), then for every β (0, 1), there exists an online learner for F with respect to ℓ with expected regret

c T T β R(T β, |Y|) + e M q

2T 1+β|Y| ln(|Y|),

under bandit feedback, where R(T, |Y|) is any concave, sublinear upperbound on R(T, |Y|).

The proofs for Theorem 18 and Theorem 19 are provided in Appendix G.

4.4 Online Multioutput Regression

In this section, we characterize the online learnability of multioutput function classes. Similar to the batch setting, we consider, without loss of generality, the case when Y = [0, 1]K RK for K N. In addition, we consider the same set of decomposable and non-decomposable loss functions as in the batch setting. Namely, our decomposable loss functions satisfy Assumptions 1 and 2, and our non-decomposable loss functions are ℓp norms. Informally, our main result asserts that a multioutput function class F YX is online learnable if and only if each restriction Fk is online learnable.

4.4.1 Characterizing Learnability for Decomposable Losses

In this subsection, we characterize the online learnability of multioutput function classes with respect to decomposable losses satisfying Assumption 1. Our main theorem is presented below.

Multioutput Learnability

Theorem 20. Let ℓbe a decomposable loss function satisfying Assumption 1. A multioutput function class F YX is online learnable with respect to ℓif and only if each Fk YX k is online learnable with respect to ψk d1.

Proof As usual, we prove Theorem 20 in two parts: ﬁrst suﬃciency and then necessity. The suﬃciency proof is similar to that of the proof for Hamming loss in Theorems 6 and 15. The necessity direction is more involved, but the main idea is to combine the augmentation technique used in the proof of Theorem 9 with the algorithmic conversion technique developed in the proof of Theorem 13. Part 1: Suﬃciency. We ﬁrst prove that online learnability of each restriction Fk with respect to ψk d1 is suﬃcient for online learnability of F with respect to ℓ. Since ℓ(f(x), y) = PK k=1 ψk d1(fk(x), yk) is decomposable, we can use the exact same strategy as in Section 4.2.1 to convert online learners A1, ..., AK for F1, ..., FK with respect to ψk d1 to an online learner A for F with respect to ℓ. More speciﬁcally, in each round t [T], receive xt, query the predictions A1(xt), ..., AK(xt), and ﬁnally predict the concatenation ˆyt = (A1(xt), ..., AK(xt)). Once the true label yt = (y1 t , ..., y K t ) is revealed, update each online learner Ak by passing (xt, yk t ) for k [K]. Using the exact same proof as in Section 4.2.1, it follows that the expected regret of A is PK k=1 Rk(T) where Rk(T) is the regret of online algorithm Ak. Since K is ﬁnite, the regret of A is sublinear in T when evaluated using ℓ. Part 2: Necessity. Similar to the batch setting, we prove the necessity direction of Theorem 20 constructively. That is, given oracle access to an online learner A for F with respect to ℓ, we construct an online learner Q for F1 with respect to ψ1 d1. By symmetry, a similar reduction can be used to construct online learners for each restriction Fk. As mentioned before, we assume an oblivious adversary, and therefore the stream of points to be observed by the online learner, denoted (x1, y1), ..., (x T , y T ) (X [0, 1])T , is ﬁxed beforehand. Let f 1 = arg minf1 F1 PT t=1 ψk d1(f1(xt), yt) denote the optimal function in hindsight and f F its completion. Since we are trying to construct an online learner for F1, the targets y1, ..., y T are scalarvalued. However, A is an online learner for F and therefore can only processes vector-valued targets. Thus, we need to ﬁgure out how to augment the scalar-valued targets y1, ..., y T in a way that allows us to use A to construct an online learner Q for F1. Following a similar strategy as in the proof of Theorem 13, we can construct a set of Experts that simulate online games with A by augmenting, in all possible ways, the scalar-valued targets of a sub-sample of the stream into vector-valued targets using vectors in Yα 2:k, the discretized label space for components 2 through K. In particular, our high-level strategy is to:

1. Randomly sub-sample points from the stream

2. Construct a set of Experts, each of which:

(a) Uses an independent copy of A to make predictions ˆyt = A1(xt)

(b) Augments the scalar-valued targets of each labeled instance in the sub-sampled stream to vector-valued targets using vectors in the discretized image space im(Fα 2:K)

(c) Simulates an online game with its independent copy of A over only the augmented sub-sampled stream with vector-valued targets

Raman, Subedi, and Tewari

3. Run REWA using the set of experts in Step 2 and the ψ1 d1 loss function over the original stream of points.

We now formalize this idea. For any bitstring b {0, 1}T , let φ : {t : bt = 1} im(Fα 2:K) denote a function mapping time points where bt = 1 to vectors in the discretized image space im(Fα 2:K). Let Φb (im(Fα 2:K)){t:bt=1} denote all such functions φ. For every f F, let φf b Φb be the mapping such that for all t {t : bt = 1}, φf b (t) = fα 2:K(xt). Let |b| = |{t : bt = 1}|. For every b {0, 1}T and φ Φb, deﬁne an Expert Eb,φ. Expert Eb,φ, formally presented in Algorithm 5, uses A to make predictions in each round. However, Eb,φ only updates A on those rounds where bt = 1, using φ to augment the scalar-valued labeled instance (xt, yt) to the vector-valued labeled instance (xt, (yt, φ(t))). For every b {0, 1}T , let Eb = S φ Φb{Eb,φ} denote the set of all Experts parameterized by functions φ Φb. If b is the all zeros bitstring, then Eb is empty. Therefore, we actually deﬁne Eb = {E0} S φ Φb{Eb,φ}, where E0 is the expert that never updates A and plays A1(xt) for all t [T]. Note that 1 |Eb| ( 2

Algorithm 5 Expert(b, φ) Input: Independent copy of Online Learner A for ℓ

1 for t = 1, ..., T do

2 Receive example xt 3 Predict yt = A1(xt)

4 Receive yt 5 if bt = 1 then

6 Update A by passing (xt, (yt, φ(t)))

With this notation in hand, we are now ready to present Algorithm 6, our main online learner Q for F1.

Algorithm 6 Online learner Q for F1 with respect to ψ1 d1 Input: Parameters 0 < β < 1 and 0 < α < 1

1 Let B {0, 1}T such that Bt iid Bernoulli(T β

2 Construct the set of experts EB = {E0} S φ ΦB{EB,φ} according to Algorithm 5

3 Run REWA P using EB and the loss function ψ1 d1 over the stream (x1, y1), ..., (x T , y T )

Our goal now is to show that Q enjoys sublinear expected regret. There are three main sources of randomness: the randomness involved in sampling B, the internal randomness of each independent copy of the online learner A, and the internal randomness of REWA. Let B, A and P denote the random variable associated with these sources of randomness respectively. By construction, B, A, and P are independent. Using Theorem 21.11 in Shalev-Shwartz and Ben-David (2014) and the fact that A, P, and B are independent, REWA guarantees

t=1 ψ1 d1(P(xt), yt)

t=1 ψ1 d1(E(xt), yt)

2T ln(|EB|) i .

Multioutput Learnability

t=1 ψ1 d1(Q(xt), yt)

t=1 ψ1 d1(P(xt), yt)

t=1 ψ1 d1(E(xt), yt)

2T ln(|EB|) i

t=1 ψ1 d1(EB,φf B (xt), yt)

2T ln(|EB|) i .

In the last step, we used the fact that for all b {0, 1}T and f F, Eb,φf b Eb.

It now suﬃces to upperbound E h PT t=1 ψ d1(EB,φf B (xt), yt) i . We use the same notation used to prove Theorem 13, but for the sake of completeness, we restate it here. Given an online learner A for ℓ, an instance x X, and an ordered sequence of labeled examples L (X [0, 1]K) , let A(x|L) be the random variable denoting the prediction of A on the instance x after running and updating on L. For any b {0, 1}T , fα 2:K Fα 2:K, and t [T], let Lf b<t = {(xi, (yi, fα 2:K(xi))) : i < t and bi = i} denote the subsequence of the sequence of labeled instances {(xi, (yi, fα 2:K(xi)))}t 1 i=1 where bi = 1. Using this notation, we can write

t=1 ψ1 d1(EB,φf B (xt), yt)

t=1 ψ1 d1(A1(xt|Lf B<t), yt)

t=1 ψ1 d1(A1(xt|Lf B<t), yt)P [Bt = 1]

t=1 E h ψ1 d1(A1(xt|Lf B<t), yt)P [Bt = 1] i

t=1 E h ψ1 d1(A1(xt|Lf B<t), yt)1{Bt = 1} i .

To see the last equality, note that the prediction A(xt|Lf B<t) (and therefore A1(xt|Lf B<t)) only depends on bitstring (B1, . . . , Bt 1) and the internal randomness of A, both of which are independent of Bt. Thus, we have

E h ψ1 d1(A1(xt|Lf B<t), yt)1{Bt = 1} i = E h ψ1 d1(A1(xt|Lf B<t), yt) i E [1{Bt = 1}]

= E h ψ1 d1(A1(xt|Lf B<t), yt) i P[Bt = 1]

as needed. Continuing onwards,

Raman, Subedi, and Tewari

t=1 ψ1 d1(EB,φf B (xt), yt)

t=1 ψ1 d1(A1(xt|Lf B<t), yt)1{Bt = 1}

t:Bt=1 ψ1 d1(A1(xt|Lf B<t), yt)

t:Bt=1 ℓ(A(xt|Lf B<t), (yt, f ,α 2:K(xt)))

t:Bt=1 ℓ(A(xt|Lf B<t), (yt, f ,α 2:K(xt))) B

t:Bt=1 ℓ(f (xt), (yt, f ,α 2:K(xt))) + RA(|B|)

t:Bt=1 ℓ(f (xt), (yt, f ,α 2:K(xt)))

T β E [RA(|B|)]

The ﬁrst inequality follows from the deﬁnition of ℓ. The second inequality follows from the fact that A is an online learner for ℓwith regret bound RA(T) and is updated on the stream labeled by f ,α 2:K only when Bt = 1. Now, we can upperbound the ﬁrst term as follows:

t:Bt=1 ℓ(f (xt), (yt, f ,α 2:K(xt)))

ψ1 d1(f 1 (xt), yt) +

k=2 ψk d1(f k(xt), f ,α k (xt))

t:Bt=1 ψ1 d1(f 1 (xt), yt) + X

t=1 ψ1 d1(f 1 (xt), yt)1{Bt = 1}

T β E [|B|KLα]

t=1 ψ1 d1(f 1 (xt), yt)T β

t=1 ψ1 d1(f 1 (xt), yt) + KTLα.

The ﬁrst inequality follows from the fact that ψk is L-Lipschitz and d1(f k(xt), f ,α k (xt)) α. Putting things together, we ﬁnd that,

Multioutput Learnability

t=1 ψ1 d1(Q(xt), yt)

t=1 ψ1 d1(EB,φf B (xt), yt)

2T ln(|EB|) i

t=1 ψ1 d1(f 1 (xt), yt) + KTLα + T

T β E [RA(|B|)] + E hp

2T ln(|EB|) i

t=1 ψ1 d1(f1(xt), yt) + KTLα + T

T β E [RA(|B|)] + E

2TK|B| ln( 2

where the last inequality follows from the fact that that |EB| ( 2

α)K|B| and the deﬁnition

of f . By Jensen s inequality, we further get that, E hq

2TK|B| ln( 2

2T β+1K ln( 2

α), which implies that

t=1 ψ1 d1(Q(xt), yt)

t=1 ψ1 d1(f1(xt), yt)+KTLα+ T

T β E [RA(|B|)]+

2T β+1K ln( 2

Next, by Lemma 14, there exists a concave sublinear function RA(|B|) of |B| that upperbounds RA(|B|). By Jensen s inequality, we obtain E[RA(|B|)] RA(T β), which yields

t=1 ψ1 d1(Q(xt), yt)

t=1 ψ1 d1(f1(xt), yt)+KTLα+ T

T β RA(T β)+

2T β+1K ln( 2

Picking α = 1 KTL and β (0, 1), gives that Q enjoys sublinear expected regret:

t=1 ψ1 d1(Q(xt), yt)

t=1 ψ1 d1(f1(xt), yt) 1+ T

T β RA(T β)+ q

4T β+1K ln(KTL).

This completes the proof as we have shown that Q is an online learner for F1 with respect to ψ1 d1.

4.4.2 A More General Characterization of Learnability for Decomposable Losses

Theorem 20 characterizes the learnability of multioutput function classes F with respect to decomposable loss functions ℓin terms of the learnability of Fk with respect to ψk d1.

Raman, Subedi, and Tewari

Similar to the batch setting, we can remove ψk, and characterize the learnability of F with respect to ℓin terms of the learnability of Fk s with respect to d1. However, to do so, we need to an place additional assumption on the decomposable loss function ℓ. Theorem 21 below summarizes the main result of this section.

Theorem 21. Let ℓbe any decomposable loss function satisfying Assumptions 1 and 2. A multioutput function class F YX is online learnable with respect to ℓif and only if each Fk YX k is online learnable with respect to d1.

The main tool needed to prove Theorem 21 is Lemma 22, which relates the online learnability of a scalar-output function class H [0, 1]X with respect to ψ d1 to its online learnability with respect to d1, where ψ is any monotonic, Lipschitz function such that ψ(0) = 0. The proof of Lemma 22 can be found in Appendix H.

Lemma 22. Let ψ : R 0 R 0 be any monotonic and Lipschitz function such that ψ(0) = 0. A scalar-valued function class G [0, 1]X is online learnable with respect to ψ d1, if and only if G is online learnable with respect to d1.

Note that for every 1 p < and x 0, ψ(x) = xp is a monotonic increasing, Lipschitz function. Therefore, Lemma 22 shows that online learnability with respect to dp is equivalent to online learnability with respect to d1. Combining Assumption 2, Theorem 20, and Lemma 22 immediately gives Theorem 21. Since the Sequential Fat Shattering dimension characterizes online learnability with respect to the absolute loss, Theorem 21 further implies that for any decomposable loss satisfying Assumptions 1 and 2, the ﬁniteness of fatseq γ (Fk) for all k [K] and γ > 0 is a suﬃcient and necessary condition for online multioutput learnability.

4.4.3 Characterizing Learnability of Non-Decomposable Losses

In this section, we characterize the online learnability of multioutput function classes F for a natural family of non-decomposable losses, ℓp norms for 1 p . We prove an analogous theorem to Theorem 12, by relating the online learnability of F with respect to ℓp to the online learnability of each Fk with respect to d1. We note that the proof of only uses the fact that ℓp norms are equivalent (up to a K dependent constant) to the ℓ1 norm. Since any two norms in a ﬁnite dimensional space are equivalent, Theorem 23 actually holds true for any norm in RK. But we only consider ℓp norms here due to their practical importance.

Theorem 23. Let 1 p . A function class F YX is online learnable with respect to ℓp if and only if each Fk YX k is online learnable with respect to d1.

By Theorem 20, F is online learnable with respect to ℓ1 if and only if each restriction Fk is online learnable with respect to d1. Thus, to prove Theorem 23 it suﬃces to show that F is online learnable with respect to ℓp if and only if F is online learnable with respect to ℓ1 for p > 1. At a high-level, the proof Theorem 23 follows a similar route as the proof of Theorem 12: convert an agnostic learner for F into a realizable learner for Fα and then use realizable-to-agnostic conversion for Fα. Proof Fix p > 1. By the argument above, it suﬃces to show that F is online learnable with respect to ℓp if and only if F is online learnable with respect to ℓ1. We begin by

Multioutput Learnability

proving suﬃciency - if F is online learnable with respect to ℓ1 then F is online learnable with respect to ℓp. Let A be an online learner for F with respect to ℓ1. Our goal is to construct an online learner Q for F with respect to ℓp. We assume an oblivious adversary, and therefore the stream of points to be observed by the online learner Q, denoted (x1, y1), ..., (x T , y T ) (X [0, 1]K)T , is ﬁxed beforehand. Let f = arg minf F PT t=1 ℓp(f(xt), yt) also denote the optimal function in hindsight with respect to the ℓp loss. Our strategy follows three steps. First, we show that A is a realizable online learner for Fα with respect to ℓp. Then, since |im(Fα)| ( 2

α)K < is ﬁnite and ℓp is a 1subadditive (by triangle inequality), Theorem 13 allows to convert the realizable online learner A for Fα with respect to ℓp into an agnostic online learner Q for Fα with respect to ℓp. Finally, for an appropriately selected discretization parameter α, we show that Q is also an agnostic online for F with respect to ℓp, which completes the proof. To that end, let (x1, f ,α(x1)), ..., (x T , f ,α(x T ) denote a (realizable) sequence of instances labeled by some function f ,α Fα. Since || ||q || ||p for q > p, we have that

t=1 ℓp(A(xt), f ,α(xt))

t=1 ℓ1(A(xt), f ,α(xt))

Since A is an online learner for F with respect to ℓ1, we get,

t=1 ℓ1(A(xt), f ,α(xt))

t=1 ℓ1(f(xt), f ,α(xt)) + RA(T) αKT + RA(T)

where RA(T) is the regret of online learner A. Combining things together, we have that

t=1 ℓp(A(xt), f ,α(xt))

αKT + RA(T),

showing that A is a realizable online learner for Fα with respect to ℓp for a small enough α. Now, since |im(Fα)| ( 2

α)K < is a ﬁnite, ℓp is a 1subadditive loss function, and A is a realizable online learner, for any β (0, 1), Theorem 13 gives an agnostic online learner Q for Fα with respect to ℓp with the following regret guarantee over the original stream (x1, y1), ..., (x T , y T ):

t=1 ℓp(Q(xt), yt)

t=1 ℓp(fα(xt), yt)+ T

T β (αKT β+RA(T β))+K

2T β+1K ln( 2

where αKT β + RA(T β) is any concave sublinear upperbound of αKT β + RA(T β). We also use the fact that the function T 7 αKT β is a concave sublinear function of T and the sum of two concave functions is itself a concave function. Noting that ℓp(fα(xt), yt) ℓp(f(xt), yt) + ℓp(fα(xt), f(xt)) ℓp(f(xt), yt) + αK, gives

t=1 ℓp(Q(xt), yt)

t=1 ℓp(f(xt), yt)+αKT+ T

T β (αKT β+RA(T β))+K

2T β+1K ln( 2

Raman, Subedi, and Tewari

Combining like terms together, we have

t=1 ℓp(Q(xt), yt)

t=1 ℓp(f(xt), yt) + 2αKT + T

T β RA(T β) + K

2T β+1K ln( 2

Finally, picking α = 1 2KT gives that

t=1 ℓp(Q(xt), yt)

t=1 ℓp(f(xt), yt) 1 + T

T β RA(T β) + K q

2T β+1K ln(4KT).

Since RA(T β) is sublinear in T β and β (0, 1), Q enjoys sublinear expected regret. Thus, we have shown that Q is also an agnostic online learner for F with respect to ℓp. The reverse direction follows identically and uses the fact that for any p > 1, || ||p || ||1 K|| ||p. In particular, using the exact same argument, we can show that if A is an online learner for F with respect to ℓp, then A is also a realizable online learner for Fα

with respect to ℓ1 with expected regret bound:

t=1 ℓ1(A(xt), f ,α(xt))

t=1 ℓp(A(xt), f ,α(xt))

αK2T + KRA(T).

Using A as the realizable learner in Theorem 13, for any β (0, 1), picking α = 1 (K+K2)T gives a regret bound:

t=1 ℓ1(Q(xt), yt)

t=1 ℓ1(f(xt), yt) 1 + KT

T β RA(T β) + K q

2T β+1K ln(4K2T),

where RA(T β) is any concave sublinear upperbound of RA(T β). Since β (0, 1), Q is an online learner for F with respect to ℓ1 as needed. This completes the proof of Theorem 23.

Remark. As with decomposable losses, Theorem 23 also implies that for any ℓp norm loss, the ﬁniteness of fatseq γ (Fk), for all k [K] and ﬁxed γ > 0, is a suﬃcient and necessary condition for online multioutput learnability.

5. Discussion

In this work, we give a characterization of multioutput learnability in four settings: batch classiﬁcation, online classiﬁcation, batch regression, and online regression. In all four settings, we show that a multioutput function class is learnable if and only if each restriction is learnable. All of our bounds in this paper scale with K, preventing our current techniques from extending to the case when K is inﬁnite. Accordingly, we pose it as an open question to characterize multioutput learnability when K is inﬁnite (e.g. function-space valued regression). Furthermore, we also leave it open to ﬁnd combinatorial dimensions that provide tight quantitative characterizations of batch and online multioutput learnability.

Multioutput Learnability

Acknowledgements

We acknowledge the support of NSF via grants IIS-2007055 (AT) and DMS-2413089 (AT, US). VR acknowledges the support of the NSF Graduate Research Fellowship.

Noga Alon, Shai Ben-David, Nicol o Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 44(4), 1997. ISSN 00045411. doi: 10.1145/263867.263927.

Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends R in Machine Learning, 4(3):195 266, 2012.

Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. doi: 10.1017/CBO9780511624216.

Idan Attias and Steve Hanneke. Adversarially robust pac learnability of real-valued functions, 2023.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, 2003.

Peter L. Bartlett, Philip M. Long, and Robert C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3): 434 452, 1996. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.1996.0033.

Shai Ben-David, Nicol o Cesa-Bianchi, and Philip M. Long. Characterizations of learnability for classes of {0, ..., n}-valued functions. Journal of Computer and System Sciences, 50 (1):74 86, 1995.

Shai Ben-David, D avid P al, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.

Hanen Borchani, Gherardo Varando, Concha Bielza, and Pedro Larranaga. A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5):216 233, 2015.

Philip J Brown and James V Zidek. Adaptive multivariate ridge regression. The Annals of Statistics, 8(1):64 74, 1980.

Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoﬀ. A characterization of multiclass learnability, 2022. URL https://arxiv.org/abs/2203.01550.

R obert Busa-Fekete, Heejin Choi, Krzysztof Dembczynski, Claudio Gentile, Henry Reeve, and Balazs Szorenyi. Regret bounds for multilabel classiﬁcation in sparse label regimes. Advances in Neural Information Processing Systems, 35:5404 5416, 2022.

Raman, Subedi, and Tewari

T. Ceccherini-Silberstein, M. Salvatori, and E. Sava-Huss. Groups, Graphs and Random Walks. London Mathematical Society Lecture Note Series. Cambridge University Press, 2017. doi: 10.1017/9781316576571.

Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Lena Chekina, Dan Gutfreund, Aryeh Kontorovich, Lior Rokach, and Bracha Shapira. Exploiting label dependencies for improved sample complexity. Machine learning, 91:1 42, 2013.

Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Structured prediction theory based on factor graph complexity. Advances in Neural Information Processing Systems, 29, 2016.

Amit Daniely and Tom Helbertal. The price of bandit information in multiclass online classiﬁcation. In Conference on Learning Theory, pages 93 104. PMLR, 2013.

Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesv ari, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 287 316, Barcelona, Spain, 13 15 Jun 2014. PMLR.

Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Sham M. Kakade and Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 207 232, Budapest, Hungary, 09 11 Jun 2011. PMLR.

Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. Regret analysis for performance metrics in multi-label classiﬁcation: the case of hamming and subset zero-one loss. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part I 21, pages 280 295. Springer, 2010.

Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88:5 45, 2012.

Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity, 2019.

Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. In Proceedings of the 24th annual conference on learning theory, pages 341 358. JMLR Workshop and Conference Proceedings, 2011.

Claudio Gentile and Francesco Orabona. On multilabel classiﬁcation and ranking with partial feedback. Advances in Neural Information Processing Systems, 25, 2012.

Multioutput Learnability

Giorgio Gnecco and Marcello Sanguineti. Estimates of the approximation error using rademacher complexity: learning vector-valued functions. Journal of Inequalities and Applications, 2008:1 16, 2008.

Max Hopkins, Daniel M. Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you need. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3015 3069. PMLR, 02 05 Jul 2022.

Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 935 944, 2016.

Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classiﬁcation using bayesian compressed sensing. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/ e1d5be1c7f2f456670de3d53c7b54f4a-Paper.pdf.

Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward eﬃcient agnostic learning. Machine Learning, 17:115 141, 1994.

Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent multilabel classiﬁcation. Advances in Neural Information Processing Systems, 28, 2015a.

Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent multilabel classiﬁcation. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015b. URL https://proceedings.neurips.cc/paper/2015/ file/85f007f8c50dd25f5a45fca73cad64bd-Paper.pdf.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285 318, 1987.

Weiwei Liu and Ivor Tsang. On the optimality of classiﬁer chain for multi-label classiﬁcation. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ 854d9fca60b4bd07f9bb215d59ef5561-Paper.pdf.

Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pages 3 17. Springer, 2016.

Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 17(1):177 204, 2005.

Raman, Subedi, and Tewari

Omar Montasser, Steve Hanneke, and Nathan Srebro. Vc classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, pages 2512 2530. PMLR, 2019.

Jinseok Nam, Eneldo Loza Menc ıa, Hyunwoo J Kim, and Johannes F urnkranz. Maximizing subset accuracy with recurrent neural networks in multi-label classiﬁcation. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/ file/2eb5657d37f474e4c4cf01e4882b8962-Paper.pdf.

B. K. Natarajan. On learning sets and functions. Mach. Learn., 4(1):67 97, oct 1989. ISSN 0885-6125. doi: 10.1023/A:1022605311895. URL https://doi.org/10.1023/A: 1022605311895.

Junhyung Park and Krikamol Muandet. Towards empirical process theory for vector-valued functions: Metric entropy of smooth function classes. In Proceedings of The 34th International Conference on Algorithmic Learning Theory, 2023.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. J. Mach. Learn. Res., 16(1):155 186, 2015a.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability theory and related ﬁelds, 161: 111 153, 2015b.

C Radhakrishna Rao. The theory of least squares when the parameters are stochastic and its application to the analysis of growth curves. Biometrika, 52(3/4):447 458, 1965.

Henry Reeve and Ata Kaban. Optimistic bounds for multi-output learning. In International Conference on Machine Learning, pages 8030 8040. PMLR, 2020.

Mark Rudelson and Roman Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, pages 603 648, 2006.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014.

Leslie G. Valiant. A theory of the learnable. In Symposium on the Theory of Computing, 1984.

V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. 1974.

Vladimir Naumovich Vapnik and Aleksei Yakovlevich Chervonenkis. On uniform convergence of the frequencies of events to their probabilities. Teoriya Veroyatnostei i ee Primeneniya, 16(2):264 279, 1971.

Shuo Xu, Xin An, Xiaodong Qiao, Lijun Zhu, and Lin Li. Multi-output least-squares support vector regression machines. Pattern Recognition Letters, 34(9):1078 1084, 2013.

Multioutput Learnability

Liang Yang, Xi-Zhu Wu, Yuan Jiang, and Zhi-Hua Zhou. Multi-label learning with deep forest. In ECAI, volume 325 of Frontiers in Artiﬁcial Intelligence and Applications, pages 1634 1641. IOS Press, 2020.

Niloofar Youseﬁ, Yunwen Lei, Marius Kloft, Mansooreh Mollaghasemi, and Georgios C Anagnostopoulos. Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38):1 47, 2018.

Raman, Subedi, and Tewari

Appendix A. Complexity Measures

A.1 Complexity Measures for Batch Learning

In binary classiﬁcation, the Vapnik-Chervonenkis (VC) dimension of a function class characterizes its learnability.

Deﬁnition 24 (Vapnik-Chervonenkis Dimension). A set S = {x1, . . . , xd} is shattered by a binary function class H { 1, 1}d if for every σ { 1, 1}d, there exists a hypothesis hσ H such that for all i [d], we have hσ(xi) = σi. The VC dimension of H, denoted VC(H), is the size of the largest shattered set S X. If the size of the shattered set can be arbitrarily large, we say that VC(H) = .

The learnability of a multiclass function class is characterized by its Natarajan dimension.

Deﬁnition 25 (Natarajan Dimension). A set S = {x1, . . . , xd} is shattered by a multiclass function class H YX if there exist two witness functions f, g : S Y such that f(xi) = g(xi) for all i [d], and for every σ { 1, 1}d, there exists a function hσ H such that for all i [d], we have

( f(xi) if σi = 1 g(xi) if σi = 1 .

The Natarajan dimension of H, denoted Ndim(H), is the size of the largest shattered set S X. If the size of the shattered set can be arbitrarily large, we say that Ndim(H) = .

For real-valued regression problems, the learnability is characterized in terms of the fat-shattering dimension of a function class.

Deﬁnition 26 (Fat-Shattering Dimension). A real-valued function class G [0, 1]X shatters points S = {x1, x2, . . . , xd} at scale 1 > γ > 0, if there exists witness functions r : S [0, 1] such that, for every σ { 1}d, there exists gσ G such that i [d], σi(gσ(xi) r(xi)) γ. The fat-shattering dimension of G at scale γ, denoted fatγ(G), is the size of the largest set that can be γ-shattered by G. If the size of the shattered set can be arbitrarily large, then we say that fatγ(G) = .

We also deﬁne a general notion of complexity called Rademacher complexity that provides a suﬃcient and necessary condition of uniform convergence. Since uniform convergence implies learnability, we use Rademacher complexity to argue suﬃcient conditions for learnability.

Deﬁnition 27 (Empirical Rademacher Complexity). Let D be a distribution over X Y. For a bounded loss function ℓ, deﬁne the loss class to be ℓ F = {(x, y) 7 ℓ(f(x), y)}. If S = {(x1, y1), ..., (xn, yn)} be a set of i.i.d samples drawn from D, then the empirical Rademacher complexity of ℓ F is deﬁned as

Rn(ℓ F) = Eσ

i=1 σiℓ(f(xi), yi)

where σ { 1}n is a sequence of n i.i.d. Rademacher random variables.

Multioutput Learnability

A.2 Complexity Measures for Online Learning

For the complexity measures below, it is useful to deﬁne a Z-valued binary tree (Rakhlin et al., 2015b). A binary tree T of depth d is Z-valued if each of its internal nodes are labelled by elements of Z. Such a tree can be identiﬁed by a sequence (T1, ..., Td) labelling functions Ti : { 1}i 1 Z which provide labels for each internal node. A path of length d is given by a sequence σ = (σ1, ..., σd) { 1}d. Then, Ti(σ1, ..., σi 1) gives the label of node following the path (σ1, ..., σi 1) starting from the root, going right if σj = +1 and left if σj = 1. Note that, T1 Z is the label for the root node. For brevity, we slightly abuse notation by letting Ti(σ1, ..., σi 1) = Ti(σ<i), but it is understood that Ti only depends on the preﬁx (σ1, ..., σi 1). We are now ready to formally deﬁne complexity measures in the online setting. When Y = { 1, +1} is binary, the Littlestone Dimension (Littlestone (1987)) tightly characterizes the online learnability of a function class H YX with respect to the 0-1 loss.

Deﬁnition 28 (Littlestone Dimension). Let T denote a complete binary tree of depth d whose internal nodes are labeled by elements X and two edges from parent to child nodes are labeled by 1 and +1. The tree is shattered by a binary hypothesis class G { 1, +1}X

if for every σ { 1, +1}d, there exists a hypothesis gσ G such that the root to leaf path (x1, . . . , xd) obtained by taking left when σt = 1 and right when σt = +1 satisﬁes gσ(xt) = σt for all 1 t d. The Littlestone Dimension of G, denoted Ldim(G), is the maximal depth of the complete binary tree shattered by G. If G can shatter a tree of arbitrary depth, we say that Ldim(G) = .

For ﬁnite label spaces Y, the Multiclass Littlestone Dimension (Daniely et al., 2011) tightly characterizes the online learnability of a function class H YX with respect to the 0-1 loss.

Deﬁnition 29 (Multiclass Littlestone Dimension). Let T denote a X-valued binary tree of depth d whose edges are labelled by elements from Y, such that the edges from a single parent to its child-nodes are each labeled with a diﬀerent label. The tree T is shattered by a function class H YX if, for every path σ { 1}d, there is a function hσ H such that hσ(Ti(σ<i)) = y(σi), where y(σi) is the label of the edge between nodes (Ti(σ<i), Ti+1(σ<i+1)). The Multiclass Littlestone Dimension (MCLdim) of H, denoted MCLdim(H), is the maximal depth of a complete binary tree that is shattered by H. If MCLdim = , then there exists shattered trees of arbitrarily large depth.

When Y is a bounded subset of R, the sequential fat-shattering dimension (Rakhlin et al., 2015a) at scale γ characterizes the learnability of H [0, 1]X with respect to to the absolute loss d1.

Deﬁnition 30 (Sequential Fat-Shattering Dimension). Let T denote a X-valued binary tree of depth d. The tree T is γ-shattered by a function class H [0, 1]X if there exists an R-valued binary tree R of depth d such that for all σ { 1}d, there exists hσ H such that for all t [d],

σt(hσ(Tt(σ<t)) Rt(σ<t)) γ

Raman, Subedi, and Tewari

The tree R is called the witness to shattering. The sequential fat shattering dimension of H at scale γ, denoted fatseq γ (H), is the maximal depth of a complete binary tree that is γ-shattered by H. If there exists γ-shattered trees of arbitrarily large depth, then fatseq γ (H) = .

Beyond both ﬁnite and bounded label spaces, the sequential Rademacher complexity (Rakhlin et al., 2015a) provides a useful tool for giving suﬃcient conditions for learnability.

Deﬁnition 31 (Sequential Rademacher Complexity). Let σ = {σi}T i=1 be a sequence of independent Rademacher random variables. Let T be a Z-valued binary tree of depth d. The sequential Rademacher complexity of a function class H RZ on T is deﬁned as

Rseq T (H; T ) = Eσ { 1}n

t=1 σth(Tt(σ<t))

Then, the worst-case sequential Rademacher complexity is deﬁned as Rseq T (H) = sup T Rseq T (H; T ).

Appendix B. Natarajan Dimension Characterizes Batch Multilabel Learnability

A multilabel classiﬁcation problem where labels (i.e. bitstrings) in Y are of length K can also be viewed as multiclass classiﬁcation on the target space with 2K labels. Given this observation, the Natarajan dimension of the function class F continues to characterize the multilabel learnability with respect to any loss function ℓsatisfying the identity of indiscernibles.

Theorem 32 (Ben-David et al. (1995)). Let ℓbe any loss function satisfying the identity of indiscernibles. A function class F YX is agnostic learnable with respect to ℓin the batch setting if and only if Ndim(F) < .

The proof in Ben-David et al. (1995) involves arguments based on growth function. Here, we provide proof that uses realizable and agnostic learnability due to Hopkins et al. (2022).

Proof (of suﬃciency) We ﬁrst show that the ﬁniteness of Ndim(F) is suﬃcient for learnability. Suppose Ndim(F) < . Then, we know that F is agnostic learnable with respect to 0-1 loss (Ben-David et al., 1995). Since the target space Y as well as the range space of F is ﬁnite, for every loss ℓsatisfying the identity of indiscernibles, there exists an a > 0 such that a ℓ(h(x), y) 1{h(x) = y} for any (x, y) X Y and function h YX . Let D be a realizable distribution to F with respect to ℓ. Since ℓ(y1, y2) = 0 if and only if 1{y1 = y2} = 0, the distribution D is also realizable with respect to 0-1 loss. Since F is learnable with respect to 0-1 loss, there exists a learning algorithm A with the following property: for any ϵ, δ > 0, for a suﬃciently large S Dn, the algorithm outputs a predictor h = A(S) such that, with probability 1 δ over S Dn, we have ED[1{h(x) = y}] a ϵ. Using the inequality stated above pointwise, the predictor h also satisﬁes ED[ℓ(h(x), y)] ϵ. Therefore, A is also a realizable algorithm with respect to ℓ. Since ℓsatisﬁes the identity of indiscernible, Lemma 4 guarantees the existence of agnostic PAC learner B for F with

Multioutput Learnability

respect to ℓ.

Proof (of necessity) Suppose F is learnable with respect to ℓ. Since the target space is ﬁnite, there must exist a constant b > 0 such that 1{h(x) = y} b ℓ(h(x), y) for any (x, y) X Y and any function h YX . Let D be a realizable distribution with respect to ℓ. Due to the 0 alignment property, D is also realizable with respect to 0-1 loss. Since F is learnable with respect to ℓloss, there exists a learning algorithm A with the following property: for any ϵ, δ > 0, for a suﬃciently large S Dn, the algorithm outputs a predictor h = A(S) such that, with probability 1 δ over S Dn, we have ED[ℓ(h(x), y)] b ϵ. In particular, using the inequality stated above pointwise, we obtain ED[1{h(x) = y}] ϵ. Therefore, F is learnable with respect to 0-1 loss in the realizable setting. As the ﬁniteness of the Natarajan dimension is necessary for the learnability of F under the 0-1 loss (Natarajan, 1989), we must have Ndim(F) < .

Appendix C. Proofs for Batch Multioutput Regression

C.1 Proof of Suﬃciency in Theorem 9

Proof We ﬁrst prove that the agnostic learnability of each Fk is suﬃcient for the agnostic learnability of F. As in the classiﬁcation setting, the proof here is based on a reduction. That is, given oracle access to agnostic learners Ak for each Fk with respect to ψk d1 loss, we construct an agnostic learner A for F with respect to loss ℓ. Denote Dk to be the marginal distribution of D restricted to X Yk. Let us use mk(ϵ, δ) to denote the sample complexity of Ak. Then, for all k [K], the marginal samples Sk = {(xi, yk i )}n i=1 with scalar-valued targets are iid samples form Dk. For each k [K], deﬁne gk = Ak(Sk) to be the predictor returned by algorithm Ak when trained on Sk. Since Ak is an agnostic learner for Fk, we have that for sample size n maxk mk( ϵ

K ), with probability at least 1 δ/K over samples Sk Dn k,

EDk[ψk d1(gk(x), yk)] inf fk Fk EDk[ψk d1(fk(x), yk)] + ϵ

Summing these risk bounds over all k coordinates and union bounding over the success probabilities, we get that with probability at least 1 δ over samples S Dn,

k=1 EDk[ψk d1(gk(x), yk)]

k=1 inf fk Fk EDk[ψk d1(fk(x), yk)] + ϵ.

Using the fact that the sum of inﬁmums over individual coordinates is at most the overall inﬁmum of sums followed by the linearity of expectation, we can write the expression above as

k=1 ψk d1(gk(x), yk)

k=1 ψk d1(fk(x), yk)

This shows that the learning rule that runs Ak on marginal samples Sk and concatenates the resulting scalar-valued predictors to get a vector-valued predictor is an agnostic learner

Raman, Subedi, and Tewari

for F with respect to loss ℓwith sample complexity at most maxk mk(ϵ/K, δ/K). This completes our proof of suﬃciency.

C.2 Equivalence of d1 and ψ d1 Learnability in Batch Regression

In this section, we provide proof of Lemma 11, which establishes the equivalence of d1 and ψ d1 learnability in scalar-valued batch regression. Proof (of Lemma 11) To prove suﬃciency ﬁrst, let G be agnostically learnable with respect to d1. This implies that the fat-shattering dimension of G is ﬁnite at every scale (Bartlett et al., 1996) and uniform convergence holds over the loss class d1 G. Since ψ is a Lipschitz function, a simple application of Talagrand s contraction lemma on Rademacher complexity (Bartlett and Mendelson, 2003) implies that uniform convergence holds over the loss class ψ d1 G as well. Thus, G is learnable with respect to ψ d1 via ERM. Next, we show that if G [0, 1]X is learnable with respect to ψ d1, then G is learnable with respect to d1. Since the fat-shattering dimension of G characterizes d1 learnability of G, it suﬃces to show that G being learnable with respect to ψ d1 implies fatγ(G) < for every γ (0, 1). Suppose, for the sake of contradiction, G is learnable with respect to ψ d1 but there exists a scale γ (0, 1) such that fatγ(G) = . Then, for every d N, there exists X = {x1, . . . , xd} X and a witness function r : X [0, 1] such that for every σ { 1, 1}d, there exists a gσ G such that σi(gσ(xi) r(xi)) γ for all i [d]. Deﬁne GX = {gσ G | σ { 1, 1}d} be the set of functions that shatters X. Deﬁne H = { 1, 1}X

to be a set of all functions from X to { 1, 1}. By deﬁnition of H, we must have VC(H) = d. We use an agnostic learner for G with respect to ψ d1 to construct an agnostic learner for H whose sample complexity, for large enough d, is smaller than the known lower bound for VC classes. Since fatγ(G) = , d can be made arbitrarily large and thus we derive a contradiction. Let A be the promised agnostic learner for G with respect to ψ d1 with sample complexity m(ϵ, δ). For all f [0, 1]X, deﬁne a threshold function hf : X { 1, 1} as hf(x) = 2 1{f(x) r(x)} 1. Let D be an arbitrary distribution on X { 1, 1} and DX be its marginal on X.

Algorithm 7 Agnostic PAC learner for H Input: Agnostic learner A for G, unlabeled samples SU DX, and another independent labeled samples SL D

1 Deﬁne Saug = {(SU, gα(SU)) | g GX}, all possible augmentations of SU by α-discretization of functions in GX for α γ/2.

2 Run A over all possible augmentations to get C(SU) := A S | S Saug .

3 Deﬁne C 1(SU) = {hf | f C(SU)}, a thresholded class of C(SU).

4 Return the predictor in C 1(SU) with the lowest empirical 0-1 risk over SL.

We now show that Algorithm 7 is an agnostic learner for H. Consider d SU + SL. Then, |C 1(SU)| = |Saug| (2/α)|SU| can be much smaller than 2d. Let h := arg minh H ED[1{h(x) = y}] be the optimal hypothesis for D. Note that, by deﬁnition of

Multioutput Learnability

shattering, for every h H, there exists a g GX such that h(x) = hg(x) for all x X. In particular, there must exist g GX such that h (x) = hg (x) := 2 1{g (x) r(x)} 1 for all x X. Let gα denote the α-discretization of g as deﬁned in Equation (1) for some α γ/2. Now, consider a sample (SU, g ,α(SU)) Saug. Let ˆg = A((SU, g ,α(SU))) be the predictor returned by the algorithm when run on a sample labeled by g ,α. Deﬁne hˆg = 2 1{ˆg(x) r(x)} 1 to be its thresholded function. Then, using the triangle inequality on the indicator function, we have

ED[1{hˆg(x) = y}] ED[1{h (x) = y}] + EDX[1{hˆg(x) = h (x)}]. (2)

Note that 1{hˆg(x) = h (x)} = 1{hˆg(x) = hg (x)} 1{|ˆg(x) g (x)| γ}. To see why the last inequality is true, we only have to consider the case where the indicator on the left is 1, otherwise, the inequality is trivial. Recall that 1{hˆg(x) = hg (x)} = 1 whenever ˆg(x) and g (x) lie on the opposite side of witness r(x). Since g has to be at least γ away from the witness r(x), we obtain 1{|ˆg(x) g (x)| γ} = 1. Next, using the fact that α γ/2, we have 1{|ˆg(x) g (x)| γ} 1{|ˆg(x) g ,α(x)| γ/2} because discretization can decrease the distance between these functions by at most γ/2. Furthermore, using monotonicity of ψ, we get 1{|ˆg(x) g ,α(x)| γ/2} 1{ψ(|ˆg(x) g ,α(x)|) ψ(γ/2)} . Combining everything, we get a pointwise inequality

1{hˆg(x) = h (x)} 1{ψ(|ˆg(x) g ,α(x)|) ψ(γ/2)} 1 ψ(γ/2)ψ(|ˆg(x) g ,α(x)|).

Using this inequality gives an upperbound on the risk of hˆg, namely

EDX[1{hˆg(x) = h (x)}] 1 ψ(γ/2) ED[ψ(|ˆg(x) g ,α(x)|)]. (3)

Since ˆg = A((SU, g ,α(SU))), we can use the algorithm s guarantee to get a further upperbound on the expectation above. In particular, if |SU| m(ϵψ(γ/2)

4 , δ/2), then with probability at least 1 δ/2 over sampling SU DX, we have

EDX[ψ(|ˆg(x) g ,α(x)|)] inf g G EDX[ψ(|g(x) g ,α(x)|)] + ϵψ(γ/2)

Note that infg G EDX[ψ(|g(x) g ,α(x)|)] EDX[ψ(|g (x) g ,α(x)|)] ψ(α), where the last step uses the fact that |g (x) g ,α(x)| α and ψ is monotonic. Using L-Lipschitzness of ψ and the fact that ψ(0) = 0, we get ψ(α) Lα. Picking α = min(γ/2, ϵψ(γ/2)

4L ), we get infg G EDX[ψ(|g(x) g ,α(x)|)] ϵψ(γ/2)

4 . Plugging this back to the inequality in the display above, we get to EDX[ψ(|ˆg(x) g ,α(x)|)] ϵψ(γ/2)

2 . Using this guarantee on (3), we obtain

EDX[1{hˆg(x) = h (x)}] ϵ

This bound applied to (2) yields

ED[1{hˆg(x) = y}] inf h H ED[1{h(x) = y}] + ϵ

Raman, Subedi, and Tewari

Thus, we have shown the existence of a predictor hˆg C 1(SU) that achieves agnostic PAC bounds for H. Let ˆh be the predictor returned by step 4 of the algorithm. Next, we show that for suﬃciently large SL, the predictor ˆh also attains agnostic PAC bounds. Recall that by Hoeﬀding s Inequality and union bound, with probability at least 1 δ/2, the empirical risk of every hypothesis in C 1(SL) on a sample of size 8

ϵ2 log 4|C 1(SU)|

at most ϵ/4 away from its true error. So, if |SL| 8 ϵ2 log 4|C 1(SU)|

δ , then with probability at least 1 δ/2, the empirical risk of the predictor hˆg(x) is

(x,y) SL 1{hˆg(x) = y} ED[1{hˆg(x) = y}] + ϵ

4 inf h H ED[1{h(x) = y}] + 3ϵ

where the last inequality follows from the risk guarantee of hˆg established above. Since ˆh is the empirical risk minimizer over SL, we must have

(x,y) SL 1{ˆh(x) = y} 1 |SL|

(x,y) SL 1{hˆg(x) = y} inf h H ED[1{h(x) = y}] + 3ϵ

Finally, as the population risk of ˆh is at most ϵ/4 away from its empirical risk, we have

ED[1{ˆh(x) = y}] inf h H ED[1{h(x) = y}] + ϵ,

which is the agnostic PAC guarantee for H. Applying union bounds, the entire process, running algorithm A on the dataset augmented by g ,α and the ERM in step 4, succeeds with probability 1 δ. This establishes that the Algorithm 7 is an agnostic PAC learner for H. The sample complexity of Algorithm 7 is the number of samples required for Algorithm A to succeed and the ERM in step 4 to succeed. Thus, the overall sample complexity of Algorithm 7, denoted m H(ϵ, δ), can be bounded as

m H(ϵ, δ) m A

ϵ2 log 4|C 1(SU)|

ϵ2 log 2 min(γ/2, ϵψ(γ/2)/4)

where the second inequality follows because |C 1(SU)| = |Saug| (2/α)|SU| and we need

|SU| to be of size m A ϵ ψ(γ/2)

2 . We also use the fact that α = min(γ

2 ) 4 ). However, it is well known (Shalev-Shwartz and Ben-David, 2014, Theorem 6.8) that the sample complexity of learning H in agnostic setting is

C d + log(2/δ)

for some C > 0. Thus, we must have m H(ϵ, δ) C(d + log(2/δ))/ϵ2. However, this is a contradiction because d can be arbitrarily large but m H(ϵ, δ) must have a ﬁnite upper bound for every ﬁxed ϵ, δ. Therefore, the function class G cannot be learnable with respect to ψ d1 whenever there exists a scale γ (0, 1) such that fatγ(G) = .

Multioutput Learnability

Appendix D. Rademacher Based Proof for Batch Regression

To show that the learnability of each Fk with respect to d1 is suﬃcient for the learnability of F with respect to ℓp norms for p 1, we use the fact that ℓp(f(x), y) is a KLipschitz in its ﬁrst argument with respect to norm, that is |ℓp(f(x), y) ℓp(g(x), y)| K f(x) g(x) , and use the following bound on the Rademacher complexity of the loss class ℓ F = {(x, y) 7 ℓ(f(x), y)} | f F}.

Lemma 33 (Foster and Rakhlin (2019)). Let F YX be a multioutput function class. For any δ (0, 1), there exists a constants 0 < c < 1 and C > 0 such that

Rn(ℓp F) K inf α>0

fatcϵ(Fk) log1+δ e n

The result presented here is in fact the intermediate result in Foster and Rakhlin (2019), and we provide a sketch of how their argument can be adapted to our setting. Proof Note that for f, g F, we have |ℓp(f(x), y) ℓp(g(x), y)| |ℓp(f(x), g(x))| ℓ1(f(x), g(x)) K f(x) g(x) . Furthermore, we have that |fk(x) yk| 1, so we obtain |ℓp(f(x), y)| K. Deﬁne the normalized ℓp loss as ℓp(f(x), y) := ℓp(f(x), y)/K. By standard chaining argument, we know that

Rn(ℓ F) = K Rn( ℓ F) K inf α>0

log N2( ℓp F, ϵ, n)dϵ .

Since a cover with || || norm is also a cover with respect to || ||2 norm, we have that log N2( ℓp F, ϵ, n) log N ( ℓ F, ϵ, n). Since ℓp(f(x), y) is 1 Lipschitz with respect to || || norm, following Lemma 1 of Foster and Rakhlin (2019), we obtain log N ( ℓp F, ϵ, n) PK k=1 log N (Fk, ϵ, n). A result due to Rudelson and Vershynin (2006) states that for any δ (0, 1), there exists constants 0 < ck < 1 and Ck > 0 such that

log N (Fk, ϵ, n) Ck fatckϵ(Fk) log1+δ (en/ϵ).

Picking C = maxk Ck and c = mink ck, we obtain the contraction inequality

Rn(ℓ F) K inf α>0

4α + 12 C n

k=1 fatcϵ(Fk) log1+δ e n

Using q PK k=1 fatcϵ(Fk) log1+δ e n

fatcϵ(Fk) log1+δ e n

ϵ yields the desired contraction inequality.

With Lemma 33 in our repertoire, the suﬃciency proof is a routine uniform convergence argument. Proof (of suﬃciency in Theorem 12) Suppose each restriction Fk is learnable with respect to d1. Then, we know that for all k [K] and for all 1 > γ > 0, we have fatγ(Fk) <

Raman, Subedi, and Tewari

(Bartlett et al., 1996), (Anthony and Bartlett, 1999, Chatper 19)). Using Lemma 33, for δ = 1/2, we can ﬁnd constants c, C such that

Rn(ℓp F) K inf α>0

fatcϵ(Fk) log3/2 e n

Fix α > 0. The second term inside inﬁmum vanishes as n , yielding Rn(ℓ F) 4αK. As α > 0 is arbitrary, the Rademacher complexity Rn(ℓ F) goes to 0 as n . This argument can be readily turned into non-asymptotic bounds on Rn(ℓp F) if the precise form of fatγ(Fk) as a function of γ is known. Since the empirical Rademacher complexity vanishes, uniform convergence holds over the loss class ℓp F and thus F is learnable with respect to ℓp via empirical risk minimization.

Appendix E. Online Multilabel Learnability with respect to Hamming Loss

In this section, we provide the proof of Theorem 15. Proof We ﬁrst prove that the online learnability of each restriction is suﬃcient for the online learnability of ℓH. Part 1: Suﬃciency. Our proof is based on a reduction: given oracle access to online learners {Ak}K k=1 for {Fk}K k=1 with respect to ℓ0-1, we construct an online learner A for F with respect to ℓH. In fact, similar to the batch setting, the online multilabel learning algorithm A is simple: in each round t [T], receive xt, query the predictions A1(xt), ..., AK(xt), and ﬁnally predict the concatenation ˆyt = (A1(xt), ..., AK(xt)). Once the true label yt = (y1 t , ..., y K t ) is revealed, update each online learner Ak by passing (xt, yk t ) for k [K]. It suﬃces to show that the expected regret of A is sublinear in T with respect to ℓH. By Deﬁnition 5, we have that for all k [K],

t=1 1{Ak(xt) = yk t } inf fk Fk

t=1 1{fk(xt) = yk t }

where Rk(T) is some sublinear function in T. Summing the regret bounds across all k [K] splitting up the expectations, and using linearity of expectation, we get that E h PK k=1 PT t=1 1{Ak(xt) = yk t } i E h PK k=1 inffk Fk PT t=1 1{fk(xt) = yk t } i PK k=1 Rk(T).

Noting that PK k=1 inffk Fk PT t=1 1{fk(xt) = yk t } inff F PK k=1 PT t=1 1{fk(xt) = yk t }, swapping the order of summations, and using the deﬁnition of ℓH we have that,

t=1 ℓH(A(xt), yt)

t=1 ℓH(f(xt), yt)

where A(xt) = (A1(xt), ..., AK(xt)). This concludes the proof of this direction since PK k=1 Rk(T) is still a sublinear function in T. Part 2: Necessity. Next we prove that if F is online learnable with respect to to ℓH, then each Fk is online learnable with respect to ℓ0-1. Namely, given oracle access to an online

Multioutput Learnability

learner A for F with respect to ℓH, we construct an online learner B for F1 with respect to ℓ0-1. A similar reduction can be used to construct online learners for each restriction Fk. Similar to the batch setting, the online learning algorithm B is simple: in each round t [T], receive xt, query ˆyt = A(xt) and predict ˆy1 t = A1(xt). Once the true label y1 t is revealed, update A by passing (xt, yt) where yt = (y1 t , σ2 t , ..., σK t ) and {σi t}K i=2 is an i.i.d sequence of Rademacher random variables. It suﬃces to show that the expected regret of B is sublinear in T with respect to ℓ0-1. As previously mentioned, we assume that the sequence (x1, y1 1), ..., (x T , y1 T ) is chosen by an oblivious adversary, and thus is not random. Let yt = (y1 t , σ2 t , ..., σK t ). By Deﬁnition 5, we have that,

t=1 ℓH(A(xt), yt) inf f F

t=1 ℓH(f(xt), yt)

where the expectation is over both the randomness of A(xt) and (σ2 t , ..., σK t ) and R(T, 2K) is a sub-linear function of T. Splitting up the expectation, using the deﬁnition of the Hamming loss, and by the linearity of expectation, we have that

k=1 E h 1{Ak(xt) = yk t } i inf f F

k=1 E h 1{fk(xt) = yk t } i R(T, 2K).

Next, observe that for every t [T], for every k {2, ..., K}, the randomness of yk t = σk t implies E 1{Ak(xt) = yk t } = E 1{f(xt) = yk t } = 1

t=1 E 1{A1(xt) = y1 t } + K 1

t=1 E 1{f1(xt) = y1 t } + K 1

Canceling constant factors gives, E h PT t=1 1{A1(xt) = y1 t } i inff1 F1 PT t=1 1{f1(xt) = y1 t }

R(T, 2K), showing that B is an online agnostic learner for F1 with respect to ℓ0-1.

Appendix F. MCLdim Characterizes Online Multilabel Learnability

In this section, we show that the MCLdim characterizes the online learnability of a multilabel function class F YX with respect to to any loss ℓthat satisﬁes the identity of indiscernibles. Theorem 34 makes this more precise.

Theorem 34. Let ℓbe any loss function satisfying the identity of indiscernibles. A function class F YX is online learnable with respect to ℓif and only if MCLdim(F) < .

Proof (of suﬃciency) We ﬁrst show that ﬁniteness of MCLdim is suﬃcient for online learnability. The proof follows exactly like the proof of Lemma 16. We include it here again for completeness sake. Let ℓbe any loss function satisfying the identity of indiscernibles and F YX be a multilabel function class such that MCLdim(F) = d < . Since F has ﬁnite MCLdim, the deterministic Multiclass Standard Optimal Algorithm for F ,

Raman, Subedi, and Tewari

hereinafter denoted MCSOA(F), achieves mistake-bound d in the realizable setting (Daniely et al., 2011). Therefore, following the same procedure as in Daniely et al. (2011), we can construct a ﬁnite set of experts E of size |E| = Pd j=0 T j |im(F)|j (2KT)d such that for any (oblivious) sequence of instances x1, ..., x T , for any function f F, there exists an expert Ef E, such that f(xt) = E(xt) for all t [T]. Finally, running the celebrated Randomized Exponential Weights Algorithm (REWA) using E as the set of experts and the scaled loss function ℓ

B [0, 1] guarantees that for any labelled sequence (x1, y1), ..., (x T , y T ),

t=1 ℓ(ˆyt, yt) inf E E

t=1 ℓ(E(xt), yt)

t=1 ℓ(ˆyt, yt) inf f F

t=1 ℓ(f(xt), yt)

T ln(|E|) O B p

where ˆyt is the prediction of REWA in the t th round. Thus, running REWA over the set of experts E using ℓ B gives an online learner for F with respect to ℓ.

Proof (of necessity) To prove necessity, we need to show that if F is online learnable with respect to ℓ, then MCLdim(F) < . To do so, we show that if F is online learnable with respect to ℓ, then F is online learnable with respect to ℓ0-1 in the realizable setting. Let A be an online learner for F with respect to ℓ. Then, by deﬁnition,

t=1 ℓ(A(xt), yt) inf f F

t=1 ℓ(f(xt), yt)

where R(T, 2K) is a sublinear function in T. Since ℓsatisﬁes the identity of indiscernibles, in the realizable setting, inff F PT t=1 ℓ(f(xt), yt) = 0. Therefore, under realizability,

t=1 ℓ(A(xt), yt)

Because there are only a ﬁnite number of inputs to ℓ, there must exist a universal constant a such that aℓ0-1 ℓ. Substituting in gives that,

t=1 ℓ0-1(A(xt), yt)

Since a is a universal constant that does not depend on T, R(T,2K)

a is still a sublinear function in T, implying that A is also a realizable online learner for F with respect to ℓ0-1. This completes the proof as MCLdim characterizes realizable learnability and so we must have MCLdim(F) < .

Multioutput Learnability

Appendix G. Proofs for Bandit Online Multilabel Classiﬁcation

In this section, we provide proofs for the characterization of online multilabel classiﬁcation under bandit feedback. Proof (of Theorem 19) The proof of Theorem 19 is nearly identical to the proof of Theorem 13. The only diﬀerence is that in Algorithm 4, we now need use the bandit Expert s algorithm EXP4 from Auer et al. (2002) instead of REWA. Similar to REWA, based on Theorem 2.3 in Daniely and Helbertal (2013) and the fact that A, B and P are independent, EXP4 guarantees that

t=1 ℓ(P(xt), yt)

t=1 ℓ(E(xt), yt)

2T|Y| ln(|EB|) i ,

where P(xt) denotes EXP4 s prediction in round t. The remaining proof for deriving the upper bound

t=1 ℓ(E(xt), yt)

t=1 ℓ(f(xt), yt) + c T

T β R(T β, |Y|)

is identical to that in Theorem 13, so we omit it here. Putting these pieces together gives the stated guarantee.

Proof (of Theorem 18) Let c = maxr =t ℓ(r,t)

minr =t ℓ(r,t) . We ﬁrst show necessity: if F is bandit online learnable with respect to ℓ, then each restriction Fk is online learnable with respect to ℓ0-1. This follows trivially from the fact that if A is a bandit online learner for F, then A is also an online learner for F under full-feedback. Thus, by Theorem 17, online learnability of F with respect to ℓimplies online learnability of restriction Fk with respect to the 0-1 loss. We now focus on showing suﬃciency: if for all k [K], Fk is online learnable with respect to 0-1 loss, then F is bandit online learnable with respect to loss ℓ. Since |Y| = 2K < and ℓis a c-subadditive, by Theorem 19, it suﬃces to show that there exists a realizable online learner for F with respect to ℓ. However, using Theorem 17, online learnability of each restriction Fk with respect to 0-1 the loss implies (agnostic) online learnability of F with respect to ℓ. Since an agnostic online learner is trivially a realizable online learner, the proof is complete.

Appendix H. Equivalence of d1 and ψ d1 Online Learnability

Proof (of Lemma 22) Since ψ is a Lipschitz function, the proof of suﬃciency follows immediately from Corollary 5 in Rakhlin et al. (2015a), a contraction Lemma for the sequential Rademacher complexity. Thus, we focus on proving necessity - if G is online learnable with respect to ψ d1, then G is online learnable with respect to d1. Since the sequential fat shattering dimension of G characterizes d1 learnability (Rakhlin et al., 2015a), it suﬃces to show that G being online learnable with respect to ψ d1 implies fatseq γ (G) < 0 for every γ (0, 1). Like in the batch setting, we prove this via contradiction.

Raman, Subedi, and Tewari

Suppose, for the sake of contradiction, G is online learnable with respect to ψ d1 but there exists a scale γ (0, 1) such that fatseq γ (G) = . Then, for every T N, there exists a X-valued binary tree T and a [0, 1]-valued binary witness tree R both of depth T such that for all σ { 1, 1}T , there exists gσ G such that for all t [T], σt(gσ(Tt(σ<t)) Rt(σ<t)) γ. Without loss of generality, assume that for any path σ { 1, 1}T , the set of instances {Tt(σ<t)}T t=1 are distinct. This is true because we can construct a γ-shattered tree of much bigger depth and prune away repeated instances along a path to get a tree of depth T. Deﬁne GT = {gσ G|σ { 1, 1}T } to be the set of functions that shatter T with witness R. Let X X denote the set of examples that label the internal nodes of T . Consider the binary hypothesis class H = { 1, 1}X which contains all possible functions from X to { 1, 1}. By deﬁnition of H, we must have Ldim(H) T. Therefore, T , with left and right edges labeled by 1 and +1 respectively, is shattered by H. Let T denote such a tree. Note that for all t [T], we have Tt(σ<t) = T ,t(σ<t). Since Ldim(H) T, any realizable online learner for H must make at least T

2 mistakes in expectation for an adversary that plays according to a root-to-leaf path in T chosen uniformly at random. However, using an agnostic online learner for G with respect to ψ d1, we construct a realizable online learner for H that achieves a sublinear regret bound when an adversary plays according to a root-to-leaf path in T chosen uniformly at random. Since fatseq γ (G) = , T can be made arbitrarily large, eventually giving us a contradiction.

To that end, let A be an online learner for G with respect to ψ d1 with regret RA(T). Let σ { 1, 1}T denote a sequence of T i.i.d Rademacher random variables and {(T ,t(σ<t), σt)}T t=1 the associated sequence of labeled instances determined by traversing T using σ. Note that {(T ,t(σ<t), σt)}T t=1 is a sequence of labeled instances corresponding to a root-to-leaf path in T chosen uniformly at random. By construction of H, there exists a h σ H such that h σ(T ,t(σ<t)) = σt for all t [T] and therefore the stream is realizable by H. Let g σ G be the function at the end of the root-to-leaf path corresponding to σ in T , the original tree shattered by G.

We now use A to construct an agnostic online learner for H with sublinear regret on the stream {(T ,t(σ<t), σt)}T t=1. Our algorithm is very similar to realizable-to-agnostic conversion in Theorem 13. Namely, we construct a ﬁnite set of experts, each of which uses A to make predictions, but only updates A on certain rounds. Finally, we run REWA using this set of Experts over our stream. For completeness sake, we provide the full description below.

For any bitstring b {0, 1}T , let φ : {t : bt = 1} im(Gα) denote a function mapping time points where bt = 1 to elements in the discretized image space im(Gα). Let Φb : (im(Gα)){t:bt=1} denote all such functions φ. For every g G, let φg b Φb be the mapping such that for all t {t : bt = 1}, φg b(t) = gα(Tt(σ<t)). Let |b| = |{t : bt = 1}|. For every b {0, 1}T and φ Φb, deﬁne an Expert Eb,φ. Expert Eb,φ, formally presented in Algorithm 8, uses A to make predictions in each round. However, Eb,φ only updates A on those rounds where bt = 1, using φ to produce a labeled instance (Tt(σ<t), φ(t)). For every b {0, 1}T , let Eb = S φ Φb{Eb,φ} denote the set of all Experts parameterized by functions φ Φb. If b is the all zeros bitstring, then Eb is empty. Therefore, we actually deﬁne Eb = {E0} S φ Φb{Eb,φ}, where E0 is the expert that never updates A. Note that 1 |Eb| ( 2

Multioutput Learnability

Algorithm 8 Expert(b, φ) Input: Independent copy of Online Learner A for ψ d1 1 for t = 1, ..., T do

2 Receive example T ,t(σ<t)

3 Predict ˆyt = 2 1{A(T ,t(σ<t)) Rt(σ<t)} 1

4 Receive yt = σt 5 if bt = 1 then

6 Update A by passing (T ,t(σ<t), φ(t))

With this notation in hand, we are now ready to present Algorithm 9, our main online learner Q for H with respect to 0-1 loss. The analysis is similar to the one before, but we include it below for completeness sake.

Algorithm 9 Online learner Q for H with respect to 0-1 loss Input: Parameters 0 < β < 1 and 0 < α < γ

1 Let B {0, 1}T such that Bt iid Bernoulli(T β

2 Construct the set of experts EB = {E0} S φ ΦB{EB,φ} according to Algorithm 8.

3 Run REWA P using EB and the 0-1 loss over the stream (T ,1(σ<1), σ1), ..., (T ,T (σ<T ), σT )

Our goal now is to show that Q enjoys sublinear expected regret. There are three main sources of randomness: the randomness involved in sampling B, the internal randomness of each independent copy of the online learner A, and the internal randomness of REWA. Let B, A and P denote the random variable associated with these sources of randomness respectively. By construction, A, B, and P are independent. Using Theorem 21.11 in Shalev-Shwartz and Ben-David (2014) and the fact that A, B and P, are independent, REWA guarantees,

t=1 1{P(T ,t(σ<t)) = σt}

t=1 1{E(T ,t(σ<t)) = σt}

2T ln(|EB|) i .

t=1 1{Q(T ,t(σ<t)) = σt}

t=1 1{P(T ,t(σ<t)) = σt}

t=1 1{E(T ,t(σ<t)) = σt}

2T ln(|EB|) i

t=1 1{EB,φg σ B (T ,t(σ<t)) = σt}

2T ln(|EB|) i .

In the last step, we used the fact that for all b {0, 1}T and g G, Eb,φg b Eb.

Raman, Subedi, and Tewari

It now suﬃces to upperbound E PT t=1 1{EB,φg σ B (T ,t(σ<t)) = σt} . We use the same

notation used to prove Theorem 13, but for the sake of completeness, we restate it here. Given an online learner A for ψ d1, an instance x X, and an ordered sequence of labeled examples L (X [0, 1]) , let A(x|L) be the random variable denoting the prediction of A on the instance x after running and updating on L. For any b {0, 1}T , gα Gα, and t [T], let Lg b<t = {(T ,t(σ<i), gα(T ,t(σ<i))) : i < t and bi = 1} denote the subsequence of the sequence of labeled instances {(T ,t(σ<i), gα(T ,t(σ<i)))}t 1 i=1 where bi = 1. Using this notation, we can write

t=1 1{EB,φg σ B (T ,t(σ<t)) = σt}

t=1 1 n 2 1{A(T ,t(σ<t)|Lg σ B<t) Rt(σ<t)} 1 = σt o#

t=1 1 n 2 1{A(T ,t(σ<t)|Lg σ B<t) Rt(σ<t)} 1 = h σ(T ,t(σ<t)) o#

t=1 1 n |A(T ,t(σ<t)|Lg σ B<t) g σ(T ,t(σ<t))| γ o#

t=1 1 n |A(T ,t(σ<t)|Lg σ B<t) g ,α σ (T ,t(σ<t))| γ

t=1 1 n ψ d1(A(T ,t(σ<t)|Lg σ B<t), g ,α σ (T ,t(σ<t))) ψ(γ

t=1 ψ d1(A(T ,t(σ<t)|Lg σ B<t), g ,α σ (T ,t(σ<t)))

The ﬁrst inequality follows from γ-shattering. Indeed, if 21{A(T ,t(σ<t)|Lg σ B<t) Rt(σ<t)}

1 = h σ(T ,t(σ<t)), then A(T ,t(σ<t)|Lg σ B<t) and g σ must lie on opposite sides of the witness Rt(σ<t). The second inequality stems from the choice of α < γ 2. The third inequality follows from the monotonicity of ψ. The last inequality follows from Markov s. Now, we can continue like before.

t=1 ψ d1(A(T ,t(σ<t)|Lg B<t), g ,α σ (T ,t(σ<t)))

t=1 ψ d1(A(T ,t(σ<t)|Lg B<t), g ,α σ (T ,t(σ<t)))P [Bt = 1]

t=1 ψ d1(A(T ,t(σ<t)|Lg B<t), g ,α σ (T ,t(σ<t)))1{Bt = 1}

Multioutput Learnability

To see the last equality, note that the prediction A(T ,t(σ<t)|Lg B<t) only depends on bitstring (B1, . . . , Bt 1), the string (σ1, ..., σt 1), and the internal randomness of A, all of which are independent of Bt. Thus, we have

E h ψ d1(A(T ,t(σ<t)|Lg B<t), g ,α σ (T ,t(σ<t)))1{Bt = 1} i = E h ψ1 d1(A1(xt|Lg σ B<t), yt) i E [1{Bt = 1}]

= E h ψ1 d1(A1(xt|Lg σ B<t), yt) i P[Bt = 1]

as needed. Continuing onwards,

t=1 ψ d1(A(T ,t(σ<t)|Lg σ B<t), g ,α σ (T ,t(σ<t)))

t=1 ψ d1(A(T ,t(σ<t)|Lg σ B<t), g ,α σ (T ,t(σ<t)))1{Bt = 1}

t:Bt=1 ψ d1(A(T ,t(σ<t)|Lg σ B<t), g ,α σ (T ,t(σ<t)))

t:Bt=1 ψ d1(A(T ,t(σ<t)|Lg σ B<t), g ,α σ (T ,t(σ<t))) B

t:Bt=1 ψ d1(g σ(T ,t(σ<t)), g ,α σ (T ,t(σ<t))) + RA(|B|)

The last inequality follows from the fact that A is an online learner for ψ d1 with regret bound RA(T) and is updated using a stream labeled by g ,α only when Bt = 1. Now, we can upperbound:

t:Bt=1 ψ d1(g (T ,t(σ<t)), g ,α σ (T ,t(σ<t)))

T β E [RA(|B|)]

t:Bt=1 ψ(α)

T β E [RA(|B|)]

T β E [αL|B|] + T

T β E [RA(|B|)]

T β E [RA(|B|)]

Raman, Subedi, and Tewari

The ﬁrst two inequalities follow from the fact that ψ is monotonic, L-Lipschitz, ψ(0) = 0, and d1(g (T ,t(σ<t)), g ,α σ (T ,t(σ<t)) α. Putting things together, we ﬁnd that

t=1 1{Q(T ,t(σ<t)) = σt}

t=1 1{EB,φg σ B (T ,t(σ<t)) = σt}

2T ln(|EB|) i

T β E [RA(|B|)]

2T ln(|EB|) i

T β E [RA(|B|)]

2T|B| ln( 2

where the last inequality follows from the fact that that |EB| ( 2

α)|B|. By Jensen s inequal-

ity, we further get that, E hq

2T|B| ln( 2

2T β+1 ln( 2

α), which implies that

t=1 1{Q(T ,t(σ<t)) = σt}

T β E [RA(|B|)]

2T β+1 ln( 2

Next, by Lemma 14, there exists a concave sublinear function RA(|B|) that upperbounds RA(|B|). By Jensen s inequality, we obtain E[RA(|B|)] RA(T β), which yields

t=1 1{Q(T ,t(σ<t)) = σt}

T β RA(T β) ψ(γ

2T β+1 ln( 2

Picking α = 1 LT and β (0, 1), gives that Q enjoys sublinear expected regret

t=1 1{Q(T ,t(σ<t)) = σt}

2)T β RA(T β) + q

4T β+1 ln(LT).

Since RA(T β) is sublinear in T β, Q is a realizable online learner for H with sublinear regret. Thus, for a suﬃciently large T, E h PT t=1 1{Q(T ,t(σ<t)) = σt} i < T

2 . This is a contradiction

because {(T ,t(σ<t), σt)}T t=1 is a realizable sequence of instances corresponding to a rootto-leaf path in T chosen uniformly at random and thus any realizable online learner must suﬀer expected regret at least T

2 . Thus, if there exists a scale γ > 0 such that fatseq γ (G) = , there cannot exist an online learner for G with respect to ψ d1.