# multitask_learning_with_labeled_and_unlabeled_tasks__e7d2bbec.pdf

Multi-task Learning with Labeled and Unlabeled Tasks

Anastasia Pentina 1 Christoph H. Lampert 1

In multi-task learning, a learner is given a collection of prediction tasks and needs to solve all of them. In contrast to previous work, which required that annotated training data is available for all tasks, we consider a new setting, in which for some tasks, potentially most of them, only unlabeled training data is provided. Consequently, to solve all tasks, information must be transferred between tasks with labels and tasks without labels. Focusing on an instance-based transfer method we analyze two variants of this setting: when the set of labeled tasks is ﬁxed, and when it can be actively selected by the learner. We state and prove a generalization bound that covers both scenarios and derive from it an algorithm for making the choice of labeled tasks (in the active case) and for transferring information between the tasks in a principled way. We also illustrate the effectiveness of the algorithm by experiments on synthetic and real data.

1. Introduction

In the multi-task learning setting (Caruana, 1997) a learner is given a collection of prediction tasks that all need to be solved. The hope is that the overall prediction quality can be improved by processing the tasks jointly and sharing information between them. Indeed, theoretical as well as experimental studies have shown that information transfer can reduce the amount of annotated examples per task needed to achieve good performance under various assumptions on how the learning tasks are related.

All existing multi-task learning approaches have in common, however, that they need at least some labeled training data for every task of interest. In this paper, we study a new and more challenging setting, in which for a subset of the tasks (typically the large majority) only unlabeled data

1IST Austria. Correspondence to: Anastasia Pentina <apentina@ist.ac.at>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

is available. In practice, it is highly desirable to be able to handle this situation for problems with a very large number of tasks, such as sentiment analysis for market studies: for different products different attributes matter and, thus, each product should be have its own predictor and forms its own learning task. At the same time annotating data for each such task is prohibitive, especially when new products are constantly added to the market. Another example are prediction problems, for which the ﬁxed cost of obtaining any labels for a task can be high, even when the variable cost per label are reasonable. This is a well-known issue when using crowd sourcing for data annotation: recruiting and training annotators ﬁrst imposes a large overhead, and only afterwards many labels can be obtained within a short time and at a low cost.

A distinctive feature of the setting we study is that it requires two types of information transfer: between the labeled tasks and from labeled to unlabeled ones. While the ﬁrst type is common in multi-task learning, none of the existing multi-task methods is able to handle the second type. In contrast, information transfer from labeled to unlabeled tasks is commonly studied in domain adaptation research, where, however, transfer of the ﬁrst type is typically not considered. Thus, the setting of multi-task learning with labeled and unlabeled tasks can be seen as a blend of traditional multi-task learning and domain adaptation.

In this work we focus on a transfer method that learns a predictor for every task of interest by minimizing a taskspeciﬁc convex combination of training errors on the labeled tasks (Ben-David et al., 2007; Mansour et al., 2009). We choose this method because it allows us to capture both types of information transfer between the labeled tasks and from labeled to unlabeled ones in a uniﬁed fashion. Clearly, the success of this approach depends on the choice of the weights in the convex combinations. Moreover, one can expect it also to depend on the subset of labeled tasks as well, because some subsets of tasks might be more informative and representative than the others. This suggests that it will be beneﬁcial if the labeled subset is not arbitrary but if it can be chosen in a data-dependent way. We refer to this learning scenario, where initially every task is represented only by a set of unlabeled examples and the learner can choose for which tasks to request some labels, as active task selection.

Multi-task Learning with Labeled and Unlabeled Tasks

Our main result is a generalization bound that quantiﬁes both of the aforementioned effects: it relates the total multitask error to quantities that depend on the subset of labeled tasks and on the task-speciﬁc weights used for information transfer. Using the computable quantities in the bound as an objective function and minimizing it numerically, we obtain a principled algorithm for selecting which tasks to have labeled (in the active task selection scenario) and for choosing task-speciﬁc weights and predictors for all tasks, labeled as well as unlabeled. We highlight the practical usefulness of the derived method by experiments on synthetic and real data.

The success of any information transfer approach, regardless whether it is applied in the multi-task or the domain adaptation scenario, depends on the relatedness between tasks of interest. Indeed, one cannot expect to beneﬁt from information transfer between the labeled tasks or to be able to obtain solutions of reasonable quality for the unlabeled ones if the given tasks are completely unrelated. An advantage of the method we propose is that from the associated generalization bound we can read off explicitly under which conditions the algorithm can be expected to succeed. In particular, it suggests that the proposed method is likely to succeed if the given set of tasks satisﬁes the following assumption of task smoothness: if two tasks are similar in their marginal distributions, then their optimal prediction functions are also likely to be similar. A more formal definition will be given in Section 3. The task smoothness assumption resembles the classical smoothness assumption of semi-supervised learning (Chapelle et al., 2006). It can be expected to hold in many real-world settings with a large number of tasks, for example in the aforementioned case of sentiment analysis: if two products are described using similar words, these words would likely have similar connotation for both products. Note, also, that a similar assumption appears implicitly in (Blanchard et al., 2011).

1.1. Related Work

Most existing multi-task learning methods work in the fully supervised setting and aim at improving the overall prediction quality by sharing information between the tasks. For this they employ different types of transfer: instance-transfer methods re-use training samples from different tasks (Crammer & Mansour, 2012), parametertransfer methods assume that the predictors for all tasks are similar to each other in some norm and exploit this fact through speciﬁc regularizers (Evgeniou & Pontil, 2004), representation-transfer approaches assume that the predictors for all tasks share a common (low-dimensional) representation that can be learned from the data (Argyriou et al., 2007; 2008). Follow-up works extended and generalized these concepts, e.g. by learning the relatedness of tasks (Saha et al., 2011; Kang et al., 2011) or sharing only

between subgroups of tasks (Xue et al., 2007; Kumar & Daum e III, 2012; Barzilai & Crammer, 2015). However, all of the above methods require at least some labeled data for each task.

To our knowledge, the only existing multi-task method that can be applied in the considered setting where for some tasks only unlabeled data is available is (Khosla et al., 2012). Motivated by the problem of dataset bias, this method relies on the assumption that different tasks are minor modiﬁcations (i.e. biased versions) of the same, true prediction problem. Similarly to (Evgeniou & Pontil, 2004), it uses speciﬁc regularizers and trains predictors for all tasks jointly as small perturbations of a common predictor, which corresponds to the hypothetical unbiased task and can potentially be applied to unseen problems. Thus, applied in the considered setting, this method provides one predictor for all unlabeled tasks and treats the labeled ones as slight variations of them.

Information transfer from labeled to unlabeled tasks is the question typically studied in domain adaptation research. In fact, if the set of labeled tasks is ﬁxed, any domain adaptation technique might be used to obtain solutions for unlabeled tasks, in particular those based on source reweighting (Shimodaira, 2000), representation learning (Pan et al., 2011; Glorot et al., 2011), or semisupervised transfer (Xing et al., 2007). However, by design all domain adaptation methods aim at ﬁnding the best predictor on a single target task given a ﬁxed set of source tasks. Therefore none of them can readily be applied in the active task selection setting, where the learner needs to select the labeled tasks that would lead to good performance across all tasks.

A second related setting is zero-shot learning (Larochelle et al., 2008; Lampert et al., 2013; Palatucci et al., 2009), where contextual, usually semantic, information is used to solve a learning task for which no training data is available. The situation we are interested in is more speciﬁc than this, though, as we assume that unlabeled data of the tasks is available, not context in an arbitrary form. As we will show, this allows us to derive formal performance guarantees that zero-shot learning methods typically lack.

The active task selection scenario is directly related to the question of identifying a representative set of source tasks in domain adaptation, a question that has previously been raised in the context of sentiment analysis (Blitzer et al., 2007). It also shares some features with active learning, where the learner is given a set of unlabeled samples and can choose a subset to obtain labels for. A fundamental difference is, however, that in active learning the learner needs to ﬁnd a single prediction function for all labeled and unlabeled data while in the multi-task setting each task, including unlabeled ones, potentially requires its own predictor.

Multi-task Learning with Labeled and Unlabeled Tasks

In the multi-task or zero-shot setting, active learning has so far not found widespread use. Exemplary works in this direction are (Reichart et al., 2008; Saha et al., 2010; Gavves et al., 2015), which, however, use active learning on the level of training examples, not tasks. The idea of choosing tasks was used in active curriculum selection (Ruvolo & Eaton, 2013; Pentina et al., 2015), where the learner can inﬂuence the order in which tasks are processed. However these methods nevertheless require annotated examples for all tasks of interest.

2. MTL with Labeled and Unlabeled Tasks

In the multi-task setting the learner observes a collection of prediction tasks and its goal is to learn all of them. Formally, we assume that there is a set of T tasks { D1, f1 , . . . , DT , f T }, where each task t is deﬁned by a marginal distribution Dt over the input space X and a deterministic labeling function ft : X Y. The goal of the learner is to ﬁnd T predictors h1, . . . , h T in a hypothesis set H {h : X Y} that would minimize the average expected risk:

er(h1, . . . , h T ) = 1

t=1 ert(ht), (1)

where ert(ht) = E x Dt ℓ(ht(x), ft(x)).

In this work we concentrate on the case of binary classiﬁcation tasks, Y = { 1, 1}, and 0/1-loss, ℓ(y1, y2) = 0 if y1 = y2, and ℓ(y1, y2) = 1 otherwise.

In the fully-supervised setting the learner is given a training set of annotated examples for every task of interest. In contrast, we consider the scenario where every task t is represented by a set St = {xt 1, . . . , xt n} of n unlabeled examples sampled i.i.d. according to the marginal distribution Dt. For a subset of k tasks {i1, . . . , ik}, which are either predeﬁned or, in the active scenario, can be selected based on the unlabeled data, the learner is given labels for a random subset Sij Sij of m points.

To obtain a predictor for any task, labeled or unlabeled, we consider a method that minimizes a convex combination of training errors of the labeled tasks. This choice allows us to capture, in a uniﬁed fashion, both types of information transfer between the labeled tasks and from labeled to unlabeled ones. Formally, for a set of tasks I = {i1, . . . , ik} {1, . . . , T} we deﬁne:

α [0, 1]T :

i=1 αi = 1; supp α I

for supp α = {i {1, . . . , T} : αi = 0}. Given a weight vector α ΛI, the α-weighted empirical error of a hypoth-

esis h H is deﬁned as follows:

berα(h) = X

i I αi beri(h), (3)

where beri(h) = 1

(x,y) Si ℓ(h(x), y). (4)

In order to obtain a solution for any task t the learner minimizes berαt(h) for some αt ΛI, where I is the set of labeled tasks, potentially in combination with some regularization.

The success of this approach depends on the subset I of tasks that are labeled and on the weights α1, . . . , αT . The following theorem quantiﬁes both of these effects and will later be used to chose α1, . . . , αT and potentially I in a principled way.

Theorem 1. Let d be the VC dimension of the hypothesis set H, k be the number of labeled tasks, S1, . . . , ST be T sets of size n each, where Si i.i.d. Di, and S1, . . . , ST be their random subsets of size m each, for which labels would be provided if the corresponding task is selected as labeled. Then for any δ > 0 with probability at least 1 δ over S1, . . . , ST and S1, . . . , ST uniformly for all choices of labeled tasks I = {i1, . . . , ik} and weights α1, . . . , αT ΛI, provided that they are fully determined by the unlabeled data only, and for all possible choices of h1, . . . , h T H the following inequality holds:

t=1 ert(ht) 1

t=1 berαt(ht)+ 1

i I αt i disc(St, Si)

T α 2,1 + B

T α 1,2 +C+D+ 1

i I αt iλti, (5)

disc(St, Si) = max h,h H |ber St(h, h ) ber Si(h, h )|

with ber Si(h, h ) = 1

n Pn j=1 ℓ(h(xi j), h (xi j)) is the empirical discrepancy between unlabeled samples St and Si, and

λij = min h H(eri(h) + erj(h))

i I (αt i)2, α 1,2 =

2d log(ekm/d)

8(log T + d log(en T/d))

2d log(2n) + 2 log(T) + log(4/δ)

Multi-task Learning with Labeled and Unlabeled Tasks

Proof Sketch (the full proof can be found in the supplemental material). By Theorem 2 in (Ben-David et al., 2010), for any two tasks t and i the following inequality holds for every h H:

ert(h) eri(h) + disc(Dt, Di) + λti. (6)

Thus, we obtain the following bound on the average expected error over all tasks in terms of the error on the labeled tasks:

t=1 ert(ht) 1

t=1 erαt(ht) (7)

i I αt i disc(Dt, Di) + 1

i I αt iλti,

with erαt(ht) = X

i I αt i E x Di ℓ(ht(x), fi(x)), and (8)

er Di(h, h ) = E x Di ℓ(h(x), h (x)), and (9)

disc(Dt, Di)= max h,h H| er Dt(h, h ) er Di(h, h )| (10)

is the discrepancy between two distributions (Kifer et al., 2004; Mansour et al., 2009; Ben-David et al., 2010). In order to prove the statement of the theorem we need to relate the α-weighted expected errors and discrepancies between the marginal distributions in (7) to their empirical estimates.

The proof consists of three steps. First, we show that, conditioned on the unlabeled data, 1

T PT t=1 erαt can be upper bounded in terms of 1

T PT t=1 berαt, where:

i I αi eri(h) = X

j=1 ℓ(h(xi j), fi(xi j)).

This quantity can be interpreted as a training error if the learner would receive the labels for all the samples for the chosen tasks I. Note that in case of m = n this step is not needed and we can avoid the corresponding complexity terms. In the second step we relate the average α-weighted expected errors to 1

T PT t=1 erαt. In the third step we conclude the proof by bounding the pairwise discrepancies in terms of their empirical estimates.

Step 1. Fix the unlabeled sets S1, . . . , ST . They fully determine the choice of labeled tasks I and the weights α1, . . . , αT . Therefore, conditioned on the unlabeled data, these quantities can be considered constant and the bound has to hold uniformly only with respect to h1, . . . , h T .

In order to simplify the notation we assume that I = {1, . . . , k} and deﬁne:

Φ(S1, . . . , Sk) = sup h1,...,h T

t=1 erαt(ht) berαt(ht).

Note that one could analyze this quantity using standard techniques from Rademacher analysis, if the labeled examples were sampled from the unlabeled sets i.i.d., i.e. with replacement. However, since we assume that for every i Si is a subset of Si, i.e. the labeled examples are sampled randomly without replacement, there are dependencies between the labeled examples. Therefore we utilize techniques from the literature on transductive learning (El Yaniv & Pechyony, 2007) instead. We ﬁrst apply Doob s construction to Φ in order to obtain a martingale sequence and then use Mc Diarmid s inequality for martingales (Mc Diarmid, 1989). As a result we obtain that with probability at least 1 δ/4 over sampling labeled examples:

Φ E S1,...,Sk Φ + 1

Now we need to upper bound E Φ. Using results from (Tolstikhin et al., 2014) and (Hoeffding, 1963) we observe that:

E S1,...,Sk Φ(S1, . . . , Sk) E S1,..., Sk Φ( S1, . . . , Sk), (13)

where Si is a set of m points sampled from Si i.i.d. with replacement (in contrast to sampling without replacement corresponding to Si). This means that we can upper bound the expectation of Φ over samples with dependencies by the expectation over independent samples. By doing so, applying the symmetrization trick, and introducing Rademacher random variables, we obtain that:

E S1,...,Sk Φ 1

i=1 (αt i)2

2d log(ekm/d)

A combination of (12) and (14) shows that (conditioned on the unlabeled data) with probability at least 1 δ/4 over sampling labeled examples uniformly for all choices of h1, . . . , h T the following holds:

t=1 erαt(ht) 1

t=1 berαt(ht)+1A

T α 2,1 + B

Step 2. Now we relate 1

T PT t=1 erαt to 1

T PT t=1 erαt.

The choice of the tasks to label, I, the corresponding weights, α, and the predictors, h, all depend on the unlabeled data. Therefore, we aim for a bound that is uniform in all three parameters. We deﬁne:

Ψ(S1, . . . , ST ) =

sup I sup α1,...,αT ΛI sup h1,...,h T

i=1 αt i(eri(ht) eri(ht)).

Multi-task Learning with Labeled and Unlabeled Tasks

The main instrument that we use here is a reﬁned version of Mc Diarmid s inequality, which is due to (Maurer, 2006). It allows us to use the standard Rademacher analysis, while taking into account the internal structure of the weights α1, . . . , αT . As a result we obtain that with probability at least 1 δ/4 simultaneously for all choices of tasks to be labeled, I, weights α1, . . . , αT ΛI and hypotheses h1, . . . , h T :

t=1 erαt(ht) 1

t=1 erαt(ht) + C. (16)

Step 3. To conclude the proof we bound the pairwise discrepancies in terms of their ﬁnite sample estimates. According to Lemma 1 in (Ben-David et al., 2010) for any pair of tasks i, j and any δ > 0 with probability at least 1 δ:

disc(Di, Dj) disc(Si, Sj)+2

2d log(2n) + log(2/δ)

We apply this result to every pair of tasks and combine the results using the uniform bound argument. This yields the remaining two terms on the right hand side: the weighted average of the sample-based discrepancies and the constant D. By combining the result with (15) and (16) we obtain the statement of the theorem.

3. Explanation and Interpretation

The left-hand side of inequality (5) is the average expected error over all T tasks, the quantity of interest that the learner would like to minimize but cannot directly compute. It is upper-bounded by the sum of two complexity terms and ﬁve task-dependent terms: weighted training errors on the labeled tasks, weighted averages of the distances to the labeled tasks in terms of the empirical discrepancies, two mixed norms of the weights α and a weighted average of λ-s.

The complexity terms C and D behave as O( p

d log(n T)/n) and converge to zero when the number of unlabeled examples per task, n, tends to inﬁnity. In contrast, A T α 2,1 + B

T α 1,2 in the worst case of α 2,1 = α 1,2 = T behaves as O( p

d log(km)/m) and converges to zero when the number of labeled examples per labeled task, m, tends to inﬁnity. In order for these terms to be balanced, i.e. for the uncertainty coming from the estimation of discrepancy to not dominate the uncertainty from the estimation of the α-weighted risks, the number of unlabeled examples per task n should be signiﬁcantly (for k T) larger than m. However, this is not a strong limitation under the common assumption that obtaining enough unlabeled examples is signiﬁcantly cheaper than annotated ones.

The remaining terms on the right-hand side of (5) depend on the set of labeled tasks I, the tasks-speciﬁc weights α-s and hypotheses h-s. Thus, by minimizing them with respect to these quantities one can expect to obtain values for them that are beneﬁcial for solving all tasks of interest based on the given data. For the theorem to hold, the set of labeled tasks and the weights may not depend on the labels. The part of the bound that can be estimated based on the unlabeled data only, and therefore to select I (in the active scenario) and α1, . . . , αT is:

i I αt i disc(St, Si) + A

T α 2,1 + B

T α 1,2. (17)

The ﬁrst term in (17) is the average weighted distance from every task to the labeled ones, as measured by the discrepancy between the corresponding unlabeled training samples. This term suggests that for every task t the largest weight, i.e. the highest impact in terms of information transfer, should be put on a labeled task i that has a similar marginal distribution. Note that the employed similarity , which is captured by the discrepancy, directly depends on the considered hypothesis class and loss function and, thus, is tailored to a particular setting of interest. At the same time, the mixed-norm terms α 1,2 and α 2,1 prevent the learner from putting all weight on the single closest labeled task and can be seen as some form of regularization. In particular, they encourage information transfer also between the labeled tasks, since minimizing just the ﬁrst term in (17) for every labeled tasks i I would result in all weight to be put on task i itself and nothing on other tasks, because by deﬁnition disc(Si, Si) = 0.

The ﬁrst mixed-norm term, α 2,1 inﬂuences every αt independently and encourages the learner to use data from multiple labeled tasks for adaptation. Thus, it captures the intuition that sharing from multiple labeled tasks should improve the performance. In contrast, α 1,2 connects the weights for all tasks. This term suggests to label tasks that all would be equally useful, thus preventing spending resources on tasks that would be informative for only a few of the remaining ones. Also, it prevents the learner from having super-inﬂuential labeled tasks that share with too many others. Such cases would be very unstable in the worst case scenario: mistakes on such tasks would propagate and have a major effect on the overall performance.

The effect of the mixed-norm terms can also be seen through the lens of the convergence rates. Indeed, as already mentioned above, in the case of every αt having only one non-zero component, α 2,1 and α 1,2 are equal to T and thus the convergence rate1 is O( p

1/m). However, in the opposite extreme, if every αt weights all the labeled tasks equally, i.e. αt i = 1/k for all t {1, . . . , T} and

1 O( ) is an analog of O( ) that hides logarithmic factors

Multi-task Learning with Labeled and Unlabeled Tasks

Figure 1. Schematic illustration of active task selection. Left: eight unlabeled tasks need to be solved. Center: the subset of tasks to be labeled and between-task weights are determined by minimizing (17). Right: annotated data for labeled tasks is obtained, and prediction functions (black vs. white) for each task are learned using the learned weighted combinations. Sharing can occur between labeled tasks.

i I, then α 2,1 = α 1,2 = T

k and the convergence

rate improves to O( p

1/km), which is the best one can expect from having a total of km labeled examples.

The only term on the right-hand side of (5) that depends on the hypotheses h1, . . . , h T and can be used to make a favorable choice is the weighted training error on the labeled tasks. Thus, the generalization bound of Theorem 1 suggest the following algorithm (Figure 1):

Algorithm 1. 1. estimate pairwise discrepancies between the tasks based on the unlabeled data 2. choose the tasks I to be labeled (in the active case) and the weights α1, . . . , αT by minimizing (17) 3. receive labels for the labeled tasks I 4. for every task t train a classiﬁer by minimizing (3) using the obtained weights αt.

Note, that this procedure is justiﬁed by Theorem 1: all choices are done in agreement with the conditions of the theorem and, because the inequality (5) holds uniformly for all eligible choices of labeled tasks, weights and predictors, the guarantees also hold for the resulting solution.

Algorithm 1 is guaranteed to perform well, if the solution it ﬁnds leads to a low value of the right-hand side of (5). By construction, it minimizes all data-dependent terms in (5), except for one quantity that cannot be estimated from the available data:

i I αt iλti. (18)

While discrepancy captures the similarity between marginal distributions, the λ-terms reﬂect the similarity between labeling functions: for every pair of task, t, and labeled task, i I, the corresponding value λti is small if there exists a hypothesis that performs well on both tasks. Thus, Algorithm 1 can be expected to perform well, if for any two given tasks t and i that are close to each other in terms of discrepancy (and thus in the minimization of (17) the corresponding αt i is large), there exists a hypothesis

that performs well on both of them (i.e. the corresponding λti is small). We refer to this property of the set of learning tasks as task smoothness.

Training predictors for every task of interest using data from all labeled tasks improves the statistical guarantees of the learner. However, it results in empirical risk minimization on up to km samples for T different weighted combinations. Since we are most interested in the situation when T is large, one might be interested in way to reduce the amount of necessary computation. One way to do so is to drop the mixed-norm terms from the objective function (17), in which case it reduces to

i I αt i disc(St, Si). (19)

This expression is linear in α and thus minimizing it for a ﬁxed set I will lead to assigning each task to a single labeled task that is closest to it in terms of empirical discrepancy. Each labeled task will be assigned to itself. Consequently, the learner must train only k predictors, one for each labeled task, using only its m samples. The expression (19) can be seen as the k-medoids clustering objective with tasks corresponding to points in the space with (semi- )metric deﬁned by empirical discrepancy and labeled tasks correspond to the centers of the clusters. Thus, this method reduces to k-medoids clustering, resembling the suggestion of Blitzer et al. (2007). Note that, nevertheless, the conditions of Theorem 1 are fulﬁlled, and thus its guarantees will hold for the obtained solution. The guarantees will be more pessimistic, however, than those from Algorithm 1, as the minimization ignores parts of the bound (5) and will not use the potentially beneﬁcial transfer between labeled tasks.

4. Experiments

To illustrate that the proposed algorithm can also be practically useful, we performed experiments on synthetic and real data. In both cases we choose H to be the set of all linear predictors with a bias term on X = Rd.

Synthetic data. We generate T = 1000 binary classiﬁca-

Multi-task Learning with Labeled and Unlabeled Tasks

0.5 1.0 2.5 5.0 7.5 10.0 20.0 30.0 50.0 fraction of labeled tasks in %

test error in %

Fully Labeled Multi-task(Khosla'12) PL Multi-task DA Active DA

(a) Synthetic data, complete transfer

0.5 1.0 2.5 5.0 7.5 10.0 20.0 30.0 50.0 fraction of labeled tasks in %

test error in %

Fully Labeled Multi-task(Khosla'12) PL Multi-task DA Active DA

(b) Product reviews, complete transfer

0.5 1.0 2.5 5.0 7.5 10.0 20.0 30.0 50.0 fraction of labeled tasks in %

test error in %

Fully Labeled Partially Labeled DA-SS Active DA-SS

(c) Synthetic data, single-source transfer

0.5 1.0 2.5 5.0 7.5 10.0 20.0 30.0 50.0 fraction of labeled tasks in %

test error in %

Fully Labeled Partially Labeled DA-SS Active DA-SS

(d) Product reviews, single-source transfer

Figure 2. Experimental results on synthetic and real data: average test error and standard deviation over 10 repeats.

tion tasks in R2. For each task t its marginal distribution Dt is a unit-variance Gaussian with mean µt chosen uniformly at random from the set [ 5, 5] [ 5, 5]. The label +1 is assigned to all points that have angle between 0 and π with µt (computed counter-clockwise), the other points are labeled 1. We use n = 1000 unlabeled and m = 100 labeled examples per task.

Real Data. We curate a Multitask dataset of product reviews2 from the corpus of Amazon product data3 (Mc Auley et al., 2015a;b). We select the products for which there are at least 300 positive reviews (with scores 4 or 5) and at least 300 negative reviews (with scores 1 or 2). Each of the resulting 957 products we treat as a binary classiﬁcation task of predicting whether a review is positive or negative. For every review we extract features by ﬁrst pre-processing (removing all non-alphabetical characters, transforming the rest into lower case and removing stop words) and then applying the sentence embedding procedure of (Arora et al., 2017) using 25-dimensional Glo Ve word embedding (Pennington et al., 2014). We use n = 500 unlabeled samples per task and label a subset of m = 400 examples for each of the selected tasks. The remaining data is used for testing.

Methods. We evaluate the proposed method in the case when the set of labeled tasks is predeﬁned (referred to as DA) by setting the set I to be a random subset of tasks and minimizing (17) only with respect to α-s and in the active task selection scenario where (17) is minimized

2http://cvml.ist.ac.at/productreviews/ 3http://jmcauley.ucsd.edu/data/amazon/

with respect to both I and α-s (referred to as Active DA). We compare these methods to a multi-task method based on (Khosla et al., 2012), also with random labeled tasks (the same ones as for DA). Speciﬁcally, we solve:

min w,v,b C w 2+ 1

t I vt 2 + 1 γ

t I,(x,y) St (w Tx+b y)2

(x,y) St ((w T + v T t )x + (b + bt) y)2 (20)

for γ {0, 0.1, . . . , 1} and use (w, b) for making predictions on all unlabeled tasks and (w + vt, b + bt) for each labeled task t I. For every number of labeled tasks we report the result for γ that has the best test performance averaged over 10 repeats (denoted by Multi-task), as an upper performance bound on what could be achieved by model selection.

We also evaluate the discussed simpliﬁcation of the proposed methods that consists of minimizing (19). We refer to these as DA-SS (for random predeﬁned labeled tasks) and as Active DA-SS (in the active task selection scenario). The SS stands for single source, as in this setting, each task is solved based on information from only one labeled tasks.

To provide further context for the results we also report the results of learning independent ridge regressions with access to labels for all tasks (denoted by Fully Labeled). However, this baseline has access to many more annotated examples in total than all other methods. In order to quantify this effect we also consider the setting when the learner

Multi-task Learning with Labeled and Unlabeled Tasks

has access to labels for all tasks, but fewer of them: namely, when the number of labeled tasks is k, the number of labels per task is mk/T, i.e. the total amount of labeled examples is mk, the same as for all other methods. In this case we evaluate two methods. The ﬁrst one learns ridge regressions for every task independently and thus can be seen as a reference point for the methods that do not involve information transfer between the labeled tasks, i.e. DA-SS and Active DA-SS. The second reference method is based on (Evgeniou & Pontil, 2004) and consists of minimizing (20) with γ set to 1 and processing all tasks as labeled. This approach transfers information between all the tasks and therefore we refer to it when evaluating the methods that involve information transfer between the labeled tasks, i.e. DA, Active DA and Multi-task.

Implementation. We estimate the empirical discrepancies between pairs of tasks by ﬁnding a hypothesis in H that minimizes the squared loss for the binary classiﬁcation problem of separating the two sets of instances, as in (Ben David et al., 2010). To minimize (17) for a given set of labeled tasks we use gradient descent. It is also used as a subroutine when minimizing (17) with respect to both I and α-s, for which we employ the Gra SP algorithm (Bahmani et al., 2013). Active DA-SS involves the minimization of the k-medoid risk (19), which we perform using a local search as in (Park & Jun, 2009). For both methods for the active task selection scenario we used the heuristic from k-means++ (Arthur & Vassilvitskii, 2007) for initialization. To obtain classiﬁers for the individual tasks in all scenarios we use least-squares ridge regression. Regularization constants for all methods we selected from the set {0} {10 17, 10 16 . . . 108} by 5 5-fold cross validation.

Results. The results are shown in Figure 4. First, one can see that the proposed domain adaptation-inspired method DA outperforms the multi-task method (20). This could be due to higher ﬂexibility of DA compared to Multi-task as the latter provides only one predictor for all unlabeled tasks. Indeed, the difference is most apparent in the experiment with synthetic data, where by design there is no single predictor that could perform well on a large fraction of tasks. Results on the product reviews indicate that DA s ﬂexibility of learning a speciﬁc predictor for every task can be advantageous in more realistic scenarios as well.

Second, on both datasets both methods for active task selection, i.e. Active DA and Active DA-SS, outperform the corresponding passive methods, i.e. DA and DA-SS, systematically across various fractions of the labeled tasks. In particular, both active task selection methods require substantially fewer tasks labeled to achieve the same accuracy as their analogs with randomly chosen tasks. This conﬁrms the intuition that selecting which tasks to label in a datadependent way is beneﬁcial and demonstrates that Theo-

rem 1 is capable of capturing this effect.

Another interesting observation that can be made from the results in Figure 4 is that both active and passive domain adaptation-inspired methods clearly outperform the corresponding partially labeled baselines, especially for small fractions of labeled tasks. This indicates that having more labels for fewer tasks rather than only few labels for all tasks could be beneﬁcial not only in terms of annotation costs, but also in terms of prediction accuracy.

As the number of labeled tasks gets larger, e.g. half of all tasks, the performance of the active task selection learner becomes almost identical to the performance of the Fully Labeled method, even improving over it in the case of multi-source transfer on synthetic data. This conﬁrms the intuition that in the case of many related tasks even a fraction of the tasks can contain enough information for solving all tasks.

5. Conclusion

In this work we introduced and studied a variant of multitask learning in which annotated data is available only for some of the tasks. This setting combines aspects of traditional multi-task learning, namely the transfer of information between labeled tasks, with aspects typical for domain adaptation problems, namely transferring information from labeled tasks to solve tasks for which only unlabeled data is available. The success of the learner in this setting depends on the effectiveness of information transfer and informativeness of the set of labeled tasks. We analyzed two scenarios: a passive one, in which the set of labeled tasks is predeﬁned, and the active task selection scenario, in which the learner decides for which tasks to query labels.

Our main technical contribution is a generalization bound that quantiﬁes the informativeness of the set of labeled tasks and the effectiveness of information transfer. We demonstrated how the bound can be used to make the choice of labeled tasks (in the active scenario) and to transfer information between the tasks in a principled way. We also showed how the terms in the bound have intuitive interpretations and provide guidance under which assumption of tasks relatedness the induced algorithm is expected to work well. Our empirical evaluation demonstrated that the proposed methods work also well in practice.

For future work we plan to further exploit the idea of active learning in the multi-task setting. In particular, we are interested in identifying whether by allowing the learner to make its decision on which tasks to label in an iterative way, rather than forcing it to choose all the tasks at the same time, one could obtain better learning guarantees as well as more effective learning methods.

Multi-task Learning with Labeled and Unlabeled Tasks

Acknowledgments

We thank Alexander Zimin and Marius Kloft for useful discussions. This work was in parts funded by the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 308036.

Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Multi-task feature learning. In Conference on Neural Information Processing Systems (NIPS), 2007.

Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Convex multi-task feature learning. Machine Learning, 2008.

Arora, Sanjeev, Liang, Yingyu, and Ma, Tengyu. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations (ICLR), 2017.

Arthur, David and Vassilvitskii, Sergei. k-means++: the advantages of careful seeding. In Symposium on Discrete Algorithms (SODA), 2007.

Bahmani, Sohail, Raj, Bhiksha, and Boufounos, Petros T. Greedy sparsity-constrained optimization. Journal of Machine Learning Research (JMLR), 14(1), 2013.

Barzilai, Aviad and Crammer, Koby. Convex multi-task learning by clustering. In Conference on Uncertainty in Artiﬁcial Intelligence (AISTATS), 2015.

Ben-David, Shai, Blitzer, John, Crammer, Koby, and Pereira, Fernando. Analysis of representations for domain adaptation. In Conference on Neural Information Processing Systems (NIPS), 2007.

Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza, Alex, Pereira, Fernando, and Vaughan, Jennifer Wortman. A theory of learning from different domains. Machine Learning, 2010.

Blanchard, Gilles, Lee, Gyemin, and Scott, Clayton. Generalizing from several related classiﬁcation tasks to a new unlabeled sample. In Conference on Neural Information Processing Systems (NIPS), 2011.

Blitzer, John, Dredze, Mark, and Pereira, Fernando. Biographies, Bollywood, boomboxes and blenders: Domain adaptation for sentiment classiﬁcation. In Annual Meeting of the Association for Computational Linguistics (ACL), 2007.

Caruana, Rich. Multitask learning. Machine Learning, 1997.

Chapelle, Olivier, Sch olkopf, Bernhard, and Zien, Alexander. Semi-supervised learning. The MIT Press, 2006.

Crammer, Koby and Mansour, Yishay. Learning multiple tasks using shared hypotheses. In Conference on Neural Information Processing Systems (NIPS), 2012.

El-Yaniv, Ran and Pechyony, Dmitry. Transductive Rademacher complexity and its applications. In Workshop on Computational Learning Theory (COLT), 2007.

Evgeniou, Theodoros and Pontil, Massimiliano. Regularized multi-task learning. In Conference on Knowledge Discovery and Data Mining (KDD), 2004.

Gavves, Efstratios, Mensink, Thomas, Tommasi, Tatiana, Snoek, Cees, and Tuytelaars, Tinne. Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In International Conference on Computer Vision (ICCV), 2015.

Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Domain adaptation for large-scale sentiment classiﬁcation: A deep learning approach. In International Conference on Machine Learing (ICML), 2011.

Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 1963.

Kang, Zhuoliang, Grauman, Kristen, and Sha, Fei. Learning with whom to share in multi-task feature learning. In International Conference on Machine Learing (ICML), 2011.

Khosla, Aditya, Zhou, Tinghui, Malisiewicz, Tomasz, Efros, Alexei, and Torralba, Antonio. Undoing the damage of dataset bias. In European Conference on Computer Vision (ECCV), 2012.

Kifer, Daniel, Ben-David, Shai, and Gehrke, Johannes. Detecting change in data streams. In VLDB, 2004.

Kumar, Abhishek and Daum e III, Hal. Learning task grouping and overlap in multi-task learning. In International Conference on Machine Learing (ICML), 2012.

Lampert, Christoph H., Nickisch, Hannes, and Harmeling, Stefan. Attribute-based classiﬁcation for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2013.

Larochelle, Hugo, Erhan, Dumitru, and Bengio, Yoshua. Zero-data learning of new tasks. In Conference on Artiﬁcial Intelligence (AAAI), 2008.

Mansour, Yishay, Mohri, Mehryar, and Rostamizadeh, Afshin. Domain adaptation: Learning bounds and algorithms. In Workshop on Computational Learning Theory (COLT), 2009.

Multi-task Learning with Labeled and Unlabeled Tasks

Maurer, Andreas. Concentration inequalities for functions of independent variables. Random Structures and Algorithms, 2006.

Mc Auley, J. J., Pandey, R., and Leskovec, J. Inferring networks of substitutable and complementary products. In Conference on Knowledge Discovery and Data Mining (KDD), 2015a.

Mc Auley, J. J., Targett, C., Shi, Q., and van den Hengel, A. Image-based recommendations on styles and substitutes. In International Conference on Research and Development in Information Retrieval (SIGIR), 2015b.

Mc Diarmid, Colin. On the method of bounded differences. In Surveys in Combinatorics, 1989.

Palatucci, Mark, Pomerleau, Dean, Hinton, Geoffrey E, and Mitchell, Tom M. Zero-shot learning with semantic output codes. In Conference on Neural Information Processing Systems (NIPS), 2009.

Pan, Sinno Jialin, Tsang, Ivor W, Kwok, James T, and Yang, Qiang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks (T-NN), 2011.

Park, Hae-Sang and Jun, Chi-Hyuck. A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2), 2009.

Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D. Glove: Global vectors for word representation. In Conference on Empirical Methods on Natural Language Processing (EMNLP), 2014.

Pentina, Anastasia, Sharmanska, Viktoriia, and Lampert, Christoph H. Curriculum learning of multiple tasks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

Reichart, Roi, Tomanek, Katrin, Hahn, Udo, and Rappoport, Ari. Multi-task active learning for linguistic annotations. In Annual Meeting of the Association for Computational Linguistics (ACL), 2008.

Ruvolo, Paul and Eaton, Eric. Active task selection for lifelong machine learning. In Conference on Artiﬁcial Intelligence (AAAI), 2013.

Saha, Avishek, Rai, Piyush, Daum e III, Hal, and Venkatasubramanian, Suresh. Active online multitask learning. In ICML Workshop on Budget Learning, 2010.

Saha, Avishek, Rai, Piyush, Daum e III, Hal, and Venkatasubramanian, Suresh. Online learning of multiple tasks and their relationships. In Conference on Uncertainty in Artiﬁcial Intelligence (AISTATS), 2011.

Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 2000.

Tolstikhin, I., Blanchard, G., and Kloft, M. Localized complexities for transductive learning. In Workshop on Computational Learning Theory (COLT), 2014.

Xing, Dikan, Dai, Wenyuan, Xue, Gui-Rong, and Yu, Yong. Bridged reﬁnement for transfer learning. In European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 2007.

Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. Multi-task learning for classiﬁcation with Dirichlet process priors. Journal of Machine Learning Research (JMLR), 8:35 63, 2007.