# targetembedding_autoencoders_for_supervised_representation_learning__7cafc0e6.pdf

Published as a conference paper at ICLR 2020

TARGET-EMBEDDING AUTOENCODERS FOR SUPERVISED REPRESENTATION LEARNING

Daniel Jarrett Department of Mathematics University of Cambridge, UK daniel.jarrett@maths.cam.ac.uk

Mihaela van der Schaar University of Cambridge, UK University of California, Los Angeles, USA mv472@cam.ac.uk, mihaela@ee.ucla.edu

Autoencoder-based learning has emerged as a staple for disciplining representations in unsupervised and semi-supervised settings. This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional. We motivate and formalize the general framework of targetembedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets encoding the prior that variations in targets are driven by a compact set of underlying factors. As our theoretical contribution, we provide a guarantee of generalization for linear TEAs by demonstrating uniform stability, interpreting the beneﬁt of the auxiliary reconstruction task as a form of regularization. As our empirical contribution, we extend validation of this approach beyond existing static classiﬁcation applications to multivariate sequence forecasting, verifying their advantage on both linear and nonlinear recurrent architectures thereby underscoring the further generality of this framework beyond feedforward instantiations.

1 INTRODUCTION

Representation learning deals with uncovering useful underlying structures of data, and autoencoders (Hinton & Salakhutdinov, 2006) have been a staple in a variety of problems. While much research focuses on its use in unsupervised or semi-supervised settings with such diverse objectives as sparsity (Ranzato et al., 2007), generation (Kingma & Welling, 2013), and disentanglement (Chen et al., 2018), autoencoders are also useful in purely supervised settings in particular, adding an auxiliary feature-reconstruction task to supervised classiﬁcation problems has been shown to empirically improve generalization (Le et al., 2018); in the linear case, the theoretically quantiﬁable beneﬁt matches that of simplistic norm-based regularization (Bousquet & Elisseeff, 2002; Rosasco & Poggio, 2009).

In this paper, we consider the inverse problem setting where the target space Y is high-dimensional; for instance, consider the multi-label classiﬁcation tasks of object tagging, text annotation, and image segmentation. This is in contrast to the vast majority of works designed to tackle a high-dimensional feature space X (where commonly |X| |Y|, such as in standard classiﬁcation problems). In this setting, the usual (and universal) strategy of learning to reconstruct features (Weston et al., 2012; Kingma et al., 2014; Le et al., 2018) may not be most useful: learning latent representations that encapsulate the variation within X does not directly address the more challenging problem of mapping back up to a higher-dimensional Y. Instead, we argue for leveraging intermediate representations that are compact and more easily predictable from features, yet simultaneously guaranteed to be predictive of targets. In the process, we provide a uniﬁed theoretical perspective on recent applications of autoencoders to label-embedding in static, high-dimensional classiﬁcation problems (Yu et al., 2014; Girdhar et al., 2016; Yeh et al., 2017). Extending into the temporal setting, we further empirically demonstrate the generality of target-embedding for recurrent, multi-variate sequence forecasting.

Our contributions are three-fold. First, we motivate and formalize the target-embedding autoencoder (TEA) framework: a general approach applicable to any underlying architecture. Second, we provide a theoretical learning guarantee in the linear case by demonstrating uniform stability; speciﬁcally, we obtain an O(1/N) bound on instability by analogizing the beneﬁt of the auxiliary reconstruction task to a form of regularization without incurring additional bias from explicit shrinkage. Finally, we extend empirical validation of this approach beyond the domain of static classiﬁcation: using the task of multivariate disease trajectory forecasting as case study, we experimentally validate the

Published as a conference paper at ICLR 2020

reconstruct

(b) Target-Embedding Autoencoder (a) Feature-Embedding Autoencoder Figure 1: (a) Feature-embedding and (b) Target-embedding autoencoders. Solid lines correspond to the (primary) prediction task; dashed lines to the (auxiliary) reconstruction task. Shared components are involved in both.

advantage that TEAs confer on both linear and nonlinear architectures using real-world datasets with both continuous and discrete targets. To the best of our knowledge, we are the ﬁrst to formalize and quantify the theoretical beneﬁt of autoencoder-based target-representation learning in a purely supervised setting, and to extend its application to the domain of multivariate sequence forecasting.

2 TARGET-EMBEDDING AUTOENCODERS

Let X and Y be ﬁnite-dimensional vector spaces, and consider the supervised learning problem of predicting targets y Y from features x X. With a ﬁnite batch of N training instances D = {(xn, yn)}N n=1, the objective is to learn a mapping h : X Y that generalizes well to new samples from the same distribution. The vast majority of existing work consider the setting most commonly, classiﬁcation where |X| |Y|; under this scenario, autoencoders are often used to ﬁrst transform the input into some lower-dimensional representation z Z amenable to the downstream task. Doing so involves adding an auxiliary reconstruction loss ℓr to the primary prediction loss ℓp.

Formally, solutions of this form in supervised and semi-supervised settings alike consist of a shared forward model φ : X Z, a reconstruction function r : Z X, and a prediction function d : Z Y during training (where notation d reﬂects the downstream nature of the prediction task). Denote x = r(φ(x)) and ˆy = d(φ(x)); then the complete loss function takes the following form,

N PN n=1 [ℓp(ˆyn, yn) + ℓr( xn, xn)] (1)

In contrast, we focus on settings where the target space Y is high-dimensional, and where possibly |Y| > |X|. In this case, we argue that learning to reconstruct the input is not necessarily most beneﬁcial. In a simple classiﬁcation problem, autoencoding inputs leverages the hypothesis that a reconstructive representation is also likely discriminative. In our setting, however, the more immediate problem is the high-dimensional structure of Y; in particular, there is little guarantee that intermediate representations trained to encapsulate x are easily mapped back up to higher-dimensional targets.

Our goal is to make use of intermediate representations that are both predictable from features as well as predictive of targets. A target-embedding autoencoder (TEA) versus what we shall term a feature-embedding autoencoder (FEA) ﬂips the model architecture around by learning an embedding of target vectors instead, which a predictor then learns a mapping into. This involves an encoder e : Y Z, an upstream predictor u : X Z, and a shared forward model θ : Z Y. Denote y = θ(e(y)) and ˆy = θ(u(x)); the complete loss function is now of the following form,

N PN n=1 [ℓp(ˆyn, yn) + ℓr( yn, yn)] (2)

Abstractly, the general idea of target space reduction is not new; in particular, it has been present in various solutions in the domain of multi-label classiﬁcation (see Section 4 and Appendix B for discussions of related work). Here we focus on target-embedding autoencoders; they leverage the assumption that variations in (high-dimensional) target space are driven by a compact and predictable set of factors. By construction, learning to reconstruct directly in output space ensures that latent representations are predictive of targets; at the same time, jointly training with the prediction loss ensures that latent representations are predictable from features. Instead of learning representations for mapping out of (downstream), here we learn representations for mapping into (upstream); the shared forward model handles the rest. See Figure 1 for high-level diagrams of TEAs versus FEAs.

Training and Inference. Figure 2 gives block diagrams of component functions and objectives in (a) FEAs and (b) TEAs during training (see Algorithm 1 in Appendix C for pseudocode). Training occurs in three stages. First, the autoencoder is trained (to learn representations): the parameters of e and θ are learned on the reconstruction loss. Second, the prediction arm is trained to regress the learned

Published as a conference paper at ICLR 2020

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆy = (ˆz; )

ˆz = u(x; Wu)

z = e(y; We)

(b) Target-Embedding Autoencoder

Latent Code

Prediction Reconstruction

Feature Vector

x = r(ˆz; Wr) ˆy = d(ˆz; Wd)

ˆz = φ(x; Φ)

(a) Feature-Embedding Autoencoder Figure 2: Functions and objectives in (a) FEAs and (b) TEAs. Blue and red identify supervised and representation learning components. FEAs are parameterized by (Φ, Wd, Wr) of (φ, d, r), and TEAs by (Θ, Wu, We) of (θ, u, e). Solid lines indicate forward propagation of data; dashed lines indicate backpropagation of gradients.

embeddings (generated by the encoder): the parameters of u are learned (on the latent loss) while the autoencoder is frozen. Finally, all three components are jointly trained on both prediction and reconstruction losses (Equation 2): parameters of the predictor, embedding, and shared forward model are trained simultaneously. Note that during training, the forward model receives two types of latents as input: encodings of true targets, as well as encodings predicted from features. At inference time, the target-embedding arm is dropped, leaving the learned hypothesis h = θ u for prediction. Figure 4 (Appendix C) provides step-by-step block diagrams of both training and inference in greater detail.

We emphasize that TEAs as is the case with FEAs specify a general framework independent of the implementation details of each component. For instance, the solutions to applications in Yu et al. (2014), Yeh et al. (2017), and Girdhar et al. (2016) can be abstractly regarded as linear and nonlinear instances of this framework, with domain-speciﬁc architectures (see Section 4 and Appendix B for more detailed discussions). Linear TEAs which we study in greater detail in Section 3 involve parameterizations (Θ, Wu, We) of (θ, u, e) consisting of single hidden layers with linear activation.

3 STABILITY-BASED LEARNING GUARANTEE

Two questions are outstanding. The ﬁrst is theoretical. We are motivated by the prior that variations in target space are driven by a lower-dimensional set of underlying factors. In this context, can we say something more rigorous about the beneﬁt of TEAs? In this section, we take the ﬁrst step in showing that jointly learning target representations improves generalization performance in the supervised setting. Speciﬁcally, we demonstrate that linear TEAs are characterized by uniform stability, from which theoretical guarantees are known to follow. The second question is empirical. We noted above that certain applications of label-embedding to classiﬁcation can be interpreted through this framework. Does the beneﬁt extend beyond its static, feedforward instantiations into the temporal setting for multi-variate sequence forecasting, with both continuous and discrete targets? In Section 5, we ﬁrst validate our theoretical ﬁndings with linear models and sensitivities, as well as extending our empirical analysis to the realm of recurrent, nonlinear models for both regression and classiﬁcation.

Consider a linear TEA, where the upstream predictor is parameterized by Wu R|Z| |X|, targetembedding by We R|Z| |Y|, and shared forward model by Θ R|Y| |Z|, where |Z| < |Y|. The complete loss function is given by L = 1

N PN n=1 [ℓp(ΘWuxn, yn) + ℓr(ΘWeyn, yn)] following Equation 2. Interpreting the jointly learned autoencoding component as an auxiliary task, we show that the TEA algorithm for learning the shared forward model Θ is uniformly stable with respect to the domain of the supervised prediction task. To establish our notation, ﬁrst recall the following:

Deﬁnition 1 (Generalization Bound) Given a learning algorithm D 7 h D that returns hypothesis h D, let R(h D) = R ℓ(h D(x), y)dµ(x, y) denote the risk, and ˆR(h D) = 1 N PN n=1 ℓ(h D(xn), yn) denote the empirical risk, where ℓis some loss function. A generalization bound is a probabilistic bound on the defect that takes the following form: R(h D) ˆR(h D) ϵ with some conﬁdence 1 δ.

Published as a conference paper at ICLR 2020

Deﬁnition 2 (Uniform Stability) Let Di denote a modiﬁcation of batch D where the i-th training instance (xi, yi) is replaced by an independent and identically distributed example (x i, y i). A learning algorithm is said to be γ-uniformly stable with respect to the loss function ℓif D (X Y)N, i {1, ..., n}, (x, y), (x i, y i) X Y : |ℓ(h D(x), y) ℓ(h Di(x), y)| γ. Uniform stability holds if the minimum value of γ converges to zero as batch size N increases without limit.

Uniform stability can be used to derive algorithm-dependent generalization bounds. In particular, Bousquet & Elisseeff (2002) ﬁrst showed that the defect ϵ of a γ-uniformly stable algorithm is is less than O((γ + 1/N) p

N log(1/δ)) with probability 1 δ. Feldman & Vondrak (2018) recently demonstrated an improved bound of O( p

(γ + 1/N) log(1/δ)). Here, we show uniform stability for linear TEAs, where γ is O(1/N) by which a tight generalization bound follows immediately. Before we begin, we introduce two additional tools: c-strong convexity and σ-admissibility. Note that these conditions are standard and easily satisﬁed for instance, by the quadratic loss function; for more context see for example Bousquet & Elisseeff (2002), Liu et al. (2016), and Mohri et al. (2018).

Deﬁnition 3 (c-Strong Convexity) Differentiable loss function ℓis c-strongly convex if h, h H : h (x) h(x), ℓ(h (x), y) ℓ(h(x), y) c h (x) h(x) 2 2 for some c R+, where ℓ(h(x), y) denotes the gradient with respect to h(x), and , denotes the dot product operation.

Deﬁnition 4 (σ-Admissibility) Loss function ℓis σ-admissible with respect to the underlying hypothesis class H if h, h H : |ℓ(h (x), y) ℓ(h(x), y)| σ h (x) h(x) 2 for some σ R+.

To obtain uniform stability, we make two assumptions both analogous to prior work arguing from the beneﬁt of learning shared models between tasks. Liu et al. (2016) deals with learning multiple tasks in general, and Le et al. (2018) deals with reconstructing inputs in what we describe as FEAs. Now in multi-task learning, the separate tasks are usually chosen due to some prior relationship between them. In the case of Assumption 1 in Liu et al. (2016) and Assumption 5 in Le et al. (2018), this is assumed to come from similarities in feature structures across tasks; hence their assumptions of cross-representability are made in feature space. (Note that this restricts primary and auxiliary features to be elements of the same space). Our setting is contrary: the inputs to primary and auxiliary tasks come from different spaces, but are trained to produce similar labels through a compact, shared latent space; hence our assumption of cross-representability will be made in this latent space instead.

Assumption 1 (Representative Vectors) There exists a representative subset of target vectors B = {b1, ..., b M} {y1, ..., y N} such that the latent representation of any individual (x, y) can be linearly reconstructed from that of the representative subset with small error, i.e. Wpx = PM m=1 αm Webm +η and Wey = PM m=1 βm Webm +η for some coefﬁcients αm, βm R, where ΣM m=1α2 m r2 α, ΣM m=1β2 m r2 β for some rα, rβ R+, and η is a small error satisfying η 2 ε.

Remark 1 This assumption is comparatively mild, even for ε = 0. Note that in Liu et al. (2016) the features for separate tasks come from different examples in general, and the similarity of their distributions within X is simply assumed. Here, each pair of inputs to the prediction and reconstruction tasks comes from the same instance, and similarity within Z is explicitly enforced through the (joint) training objective. In addition, observe that the assumption will hold with zero error as long as the number of independent latent vectors is at least |Z|. Furthermore, unlike the scenarios in Liu et al. (2016) and Le et al. (2018) we do not require that the input domains of the two tasks be identical. Therefore for ease of exposition, we assume going forward that ε = 0 (see Remark 6 in Appendix A). Remark 2 A comparison with Assumption 4 in Le et al. (2018) sheds additional light on why we expect TEAs to be beneﬁcial where |Y| > |Z|, in contrast with the (more typical) scenario |X| > |Z| |Y|. Critically, the technique in Le et al. (2018) banks on the fact that the prediction arm projects the latent into a lower-dimensional target space. Conversely, Assumption 1 here relies on the fact that the encoding arm maps into the latent space from a higher-dimensional target space (rendering cross-representability therein reasonable). The distinction is crucial: we certainly do not expect any beneﬁt from autoencoding trivially low-dimensional vectors! Note also that here the representative vectors are taken from Y; to take them from X instead would be unreasonable. For any compressive autoencoder, we generally expect if some subset {b1, ..., b M} {y1, ..., y N} spans Y that {Web1, ..., Web M} then also span Z in order to be maximally reconstructive. The same cannot be said of subsets {c1, ..., c M} {x1, ..., x N} that span X for instance, take |X| |Z|.

In addition to being representative in terms of latent values, the set of representative points also needs to be representative in terms of the reconstruction error. First, let L denote the counterpart

Published as a conference paper at ICLR 2020

to L where the i-th sample (xi, yi) is replaced by some new instance (x i, y i) X Y; that is, L = 1 N [ℓp(ΘWux i, y i) + ℓr(ΘWey i, y i) + PN n=1;n =i (ℓp(ΘWuxn, yn) + ℓr(ΘWeyn, yn))]. Then, let Θ , Θ denote the optimal parameters corresponding to the two losses L and L .

Assumption 2 (Representative Errors) Let L r contain the reconstruction errors of the dataset without the i-th sample: L r = (1/N)ΣN n=1;n =iℓr(ΘWeyn, yn), and let LB r denote the reconstruction error of the representative subset: LB r = (1/M)ΣM m=1ℓr(ΘWebm, bm). Then there exists some a > 0 such that for any small κ > 0 : LB r (Θ ) LB r (κΘ + (1 κ)Θ ) + LB r (Θ ) LB r (κΘ + (1 κ)Θ ) a [L r(Θ ) L r(κΘ + (1 κ)Θ )] + a [L r(Θ ) L r(κΘ + (1 κ)Θ )].

That is, the difference in reconstruction error LB r between the two points Θ , Θ is upper bounded by some constant factor of the corresponding difference in reconstruction error L r at the two points. Importantly, note that this does not require that the values of the errors L r and LB r themselves be similar, only that their differences be similar. This assumption is identical to Assumption 6 in Le et al. (2018), and plays an identical role: we make use of LB r which is only dependent on M to allow the bound to decay with N; this is in contrast with the generic multi-task analysis of Liu et al. (2016), which if applied directly to TEAs (as with FEAs) would give a bound that does not decay with N.

Theorem 1 (Uniform Stability) Let ℓp and ℓr be σp-admissible and σr-admissible loss functions, and let ℓr be c-strongly convex. Then under Assumptions 1 and 2, the following inequality holds,

|ℓp(Θ Wux, y) ℓp(Θ Wux, y)| 2(σ2 pr2 α + σpσrrαrβ)a M

c N Proof. Appendix A.

Corollary 1 (Generalization Bound) Consider the same conditions as in Theorem 1; that is, let ℓp and ℓr be σp-admissible and σr-admissible losses, and let ℓr be c-strongly convex. Then under Assumptions 1 and 2, the defect ϵ is less than O( p

(1/N) log(1/δ)) with probability at least 1 δ.

Proof. Follows immediately from Theorem 1 (above) and either of the following (similar results hold for both): Theorem 1.2 (Feldman & Vondrak, 2018), and Theorem 12 (Bousquet & Elisseeff, 2002).

In supervised learning, it is often easy to make an argument on an intuitive level for the regularizing effect of additional loss terms. In contrast, this analysis allows us to unambiguously identify and quantify the beneﬁt of the embedding component as a regularizer (see Remark 4 in Appendix A).

Remark 3 In the linear label space reduction framework of Yu et al. (2014), uniform convergence is also shown to hold via norm-based regularization. Speciﬁcally for uniform stability, a similar bound can also be achieved by adding a strongly convex term to the objective, such as Tikhonov and ℓ2 regularization (Bousquet & Elisseeff, 2002; Rosasco & Poggio, 2009; Shalev-Shwartz et al., 2010). Here, however, the joint reconstruction task leverages a different kind of bias precisely, the assumption that there exist compact and predictable representations of targets. Therefore the signiﬁcance of this analysis is that we achieve an equivalent result independent of explicit regularization.

4 RELATED WORK

Our work straddles three threads of research: (1) supervised representation learning with autoencoders, (2) label space reduction for multi-label classiﬁcation, and (3) stability-based learning guarantees. Appendix B provides a much expanded treatment, and presents summary tables for additional context.

Supervised representation learning. While a great deal of research is devoted to uncovering useful underlying structures of data through autoencoders with various properties such as sparsity (Ranzato et al., 2007) and disentanglement (Chen et al., 2018), among many others (Tschannen et al., 2018) the goal of better representations is often for the beneﬁt of downstream tasks. Semi-supervised autoencoders jointly optimized on partially-labeled data can obtain compact representations that improve prediction (Weston et al., 2012; Kingma et al., 2014; Ghifary et al., 2016). Furthermore, auxiliary reconstruction is also useful in a purely supervised setting: rather than focusing on how speciﬁc architectures better structure unlabeled data, Le et al. (2018) show the simple beneﬁt of feature-reconstruction on supervised classiﬁcation a special case of what we describe as FEAs.

In contrast, we focus on target-representation learning in the supervised setting, and analyze its beneﬁt under the prior that high-dimensional targets are driven by a compact and predictable set of factors.

Published as a conference paper at ICLR 2020

We take inspiration from the empirical study of Girdhar et al. (2016), where latent representations of 3D objects are jointly trained to be predictable from 2D images. Their setup can be viewed as a speciﬁc instance of TEAs with (nonlinear) convolutional components, with a minor variation in training: in the joint stage, predictors continue to regress the learned embeddings, and gradients only backpropagate from latent space (instead of target space). Unlike the symmetry of our losses (which we require for our analysis above), their common decoder is only shared indirectly (and predictions made indirectly). As it turns out, this does not appear to matter for performance (see Section 5). In Mostajabi et al. (2018), a two-stage procedure is used for semantic segmentation loosely comparable to the ﬁrst two stages in TEAs; in contrast to our emphasis on joint training, they study the beneﬁt of a frozen embedding branch in parallel with direct prediction. More broadly related to target-embedding, Dalca et al. (2018) build anatomical priors for biomedical segmentation in unsupervised settings.

Multi-label classiﬁcation. The general idea of target space dimension reduction has been explored for multi-label classiﬁcation problems (commonly, annotation based on bags of features). These ﬁrst derive a reduced label space, then subsequently associate inputs to it; methods include compressed sensing (Hsu et al., 2009), principal components (Tai & Lin, 2010), maximum-margin coding (Zhang & Schneider, 2012), and landmarking (Balasubramanian & Lebanon, 2012). Closer to our theme of joint learning, Chen & Lin (2012) ﬁrst propose simultaneously minimizing encoding and prediction errors via an SVD formulation. Using generic empirical risk minimization, Yu et al. (2014) formulate the problem as a linear model with a low-rank constraint. While this captures an intuition (similar to ours) of restricted latent factors, their error bounds require norm-based regularization (unlike ours).

Recently, Yeh et al. (2017) generalize the label-embedding approach to autoencoders. This ﬂexibly accommodates custom losses to exploit correlations, as well as deep learning for nonlinearities. Our work is related to this line of research, although we operate at a higher level of abstraction, with a signiﬁcant difference in focus. Their problem is multi-label classiﬁcation, and their starting point is binary relevance (i.e. label by label). During reduction, they worry about speciﬁc losses that capture dependencies within and among spaces. In contrast, we worry about autoencoding at all that is, we focus on the effect of joint reconstruction on learning the prediction model. Problems can be of any form: classiﬁcation or regression, and our starting point is direct prediction (i.e. no reconstruction).

Stability and learning guarantees. Generalizability via hypothesis stability is ﬁrst studied in Rogers & Wagner (1978) and Devroye & Wagner (1979); unlike arguments based on the complexity of the search space (Vapnik & Chervonenkis, 1971; Pollard, 1984; Koltchinskii, 2001), these account for how the algorithm depends on the data. Bousquet & Elisseeff (2002) ﬁrst formalize the notion of uniform stability sufﬁcient for learnability, and Feldman & Vondrak (2018) use ideas related to differential privacy (Bassily et al., 2016) for further improvement. Separately, while there is a wealth of research on dimensionality reduction and autoencoders (Singh et al., 2009; Mohri et al., 2015; Gottlieb et al., 2016; Epstein & Meir, 2019), they either operate in the semi-supervised setting, or focus on the beneﬁt of feature representations (not targets) and also do not consider joint learning.

The beneﬁt of jointly learning multiple tasks through a common operator (Caruana, 1997) is explored with VC-based (Baxter, 2000) and Rademacher complexity-based (Maurer, 2006; Maurer et al., 2016) analyses. Recently, Liu et al. (2016) show that the algorithm for learning the shared model in a multi-task setting is uniformly stable. While our argument is based on theirs, we are not interested in a generic bound for all tasks; closer to Le et al. (2018), we focus on the primary prediction task, and leverage the auxiliary reconstruction task for stability. Similarly, we arrive at an O(1/N) on instability without an explicit regularization term as in Bousquet & Elisseeff (2002). Unlike them, however, the fundamental distinction of our setting is that Y is high-dimensional (but where the underlying factors are assumed compact); in this sense our focus is the mirror opposite of theirs.

5 EXPERIMENTS AND DISCUSSION

So far, we have formalized a general target-autoencoding framework for supervised learning, and quantiﬁed the beneﬁt via uniform stability. Our overall goal in this section is to explore this beneﬁt in a simple controlled setting, such that we can identify and isolate its utility on the prediction task, and investigate any sensitivities of interest. By way of preface, we emphasize two observations from above: (1) In the static, multi-label classiﬁcation setting, the gain from label-embedding has been studied, including the autoencoder approach of Yeh et al. (2017) which can be viewed as an instantiation of TEAs with sophisticated reﬁnements. (2) The beneﬁt of target-autoencoding is also

Published as a conference paper at ICLR 2020

Table 1: Dataset statistics and input/output dimensions used in experiments

Num. Samp. Target Static Temp. dim. Temp. dim. Window Window Effective Effective Dataset patients freq. type dim. (history) (forecast) (history) (forecast) |X| |Y|

UKCF 10,000 1 yr. Binary 11 43 34 3 4 140 136 ADNI 1,700 6 m. Continuous 11 26 24 4 8 115 192 MIMIC 22,000 4 hr. Mixed 26 361 361 5 5 1,831 1,805

The effective input dimension |X| is computed as the dimension of static data plus the product of the width of the historical window (of temporal information) with its dimension; the effective target dimension |Y| is similarly computed as the product of the width of the forecast window (of temporal information) with its dimension.

demonstrated using nonlinear, convolutional architectures in Girdhar et al. (2016) which is also an instantiation of TEAs, also noting signiﬁcant gains. Therefore a natural question of interest is:

Does the utility of target-embedding extend to (nonlinear) recurrent models with sequential data for general, high-dimensional targets (i.e. regression and/or classiﬁcation)?

Disease Trajectories. In this section, we take the ﬁrst step in answering this question as our empirical contribution, we extend validation of target-embedding autoencoders to the domain of multivariate sequence forecasting, exploring its utility on linear and nonlinear sequence-to-sequence architectures. What makes a good testbed? In particular, the progression of diseases (and their markers) is high-dimensional in presentation; at the same time, their evolution is often driven by latent biological dynamics (Szczesniak et al., 2017; Pascoal et al., 2017; Alaa & van der Schaar, 2019). With the increasing importance of early diagnosis and timely intervention in healthcare, the ability to forecast individual disease trajectories (Y) in the presence of limited windows of information (X) has become increasingly desirable (Donohue et al., 2014; Pham et al., 2017; Bhagwat et al., 2018).

Datasets. We use three datasets in our experiments. The ﬁrst consists of a cohort of patients enrolled in the UK Cystic Fibrosis registry (UKCF), which records follow-up trajectories for over 10,000 patients. We are interested in forecasting future trajectories for the 11 possible infections and 23 possible comorbidities (all binary variables) recorded at each follow-up, using past trajectories and basic demographics as input. The second consists of patients in the Alzheimer s Disease Neuroimaging Initiative study (ADNI), which tracks disease progression for over 1,700 patients. We are interested in forecasting the evolution of the 8 primary biomarkers and 16 cognitive tests (all continuous variables) measured at each visit, using past measurements and basic demographics as input. The third consists of a cohort of patients in intensive care units from the Medical Information Mart for Intensive Care (MIMIC), which records physiological data streams for over 22,000 patients. Likewise, we are interested in forecasting future trajectories for the 361 most frequently measured variables such as vital signs and lab tests (both binary and continuous variables), again using past measurements and basic demographics as input. See Appendix D for more information on datasets.

Experimental Setup. In each instance, the prediction input is a precedent window of up to width wx, and the prediction target is the succedent window of width wy. For UKCF (wx, wy) = (3, 4) at 1-year resolution, for ADNI (4, 8) at 6-month resolution, and for MIMIC (5, 5) at 4-hour (resampled) resolution. All models are implemented in Tensorﬂow. Linear models consist of a single hidden layer with no nonlinearity; for the nonlinear case, we implement an RNN model for each component using GRUs. See Appendix D for additional detail on model implementation and conﬁguration. For evaluation, we measure the mean squared error (MSE) for continuous targets (averaged across variables), and the area under the precision-recall curve (PRC) and area under the receiver operating characteristic curve (ROC) for binary targets (averaged across variables). We use cross-validation on the training set for hyperparameter tuning, selecting the setting that gives the lowest validation loss averaged across folds. For each model and dataset, we report the average and standard error of each performance metric across 10 different experiment runs, each with a different random train-test split.

Note that forecasting high-dimensional disease trajectories is challenging, and input information is deliberately limited (as is often the case in medical practice); the desired targets are similar or higherdimensional than the inputs (see Table 1). This obviously results in an inherently difﬁcult prediction problem but which makes for a good candidate setting to test the utility of target-representation learning. RNN autoencoders have previously been proposed for learning representations of inputs (i.e. FEAs instantiated with RNNs) to improve classiﬁcation (Dai & Le, 2015), prediction (Lyu et al., 2018), generation (Srivastava et al., 2015), and clustering (Baytas et al., 2017); similarly, their mission is not in excessively optimizing speciﬁc architectural novelties to match state-of-the-art, but rather in exploring the beneﬁt of the autoencoding framework. Here, we learn representations of targets.

Published as a conference paper at ICLR 2020

Table 2: Summary results for TEA and comparators on linear model with UKCF (Bold indicates best)

Base REG FEA TEA F/TEA

PRC(I) 0.322 0.099* 0.347 0.085* 0.351 0.079* 0.450 0.035 0.414 0.028* PRC(C) 0.416 0.100* 0.433 0.083* 0.455 0.087* 0.559 0.060 0.520 0.052

ROC(I) 0.689 0.089* 0.710 0.072* 0.720 0.073 0.767 0.026 0.766 0.023 ROC(C) 0.679 0.091* 0.700 0.075* 0.713 0.075 0.767 0.042 0.755 0.037

The two-sample t-test for difference in means is conducted on the results. An asterisk indicates statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. PRC and ROC metrics are reported separately for variables representing infections (I) and comorbidities (C). See Tables 9 10 for extended results.

Table 3: Summary results for TEA and comparators on nonlinear (RNN) model (Bold indicates best)

UKCF (Binary Targets) ADNI (Continuous Targets) MIMIC (Mixed Targets)

PRC(I) PRC(C) MSE(B) MSE(C) PRC MSE

Base 0.411 0.035* 0.497 0.057* 0.105 0.018* 0.361 0.064 0.142 0.028* 0.153 0.011 REG 0.415 0.030* 0.518 0.052* 0.096 0.014* 0.360 0.066 0.143 0.019* 0.152 0.010 FEA 0.410 0.033* 0.521 0.054* 0.092 0.012* 0.356 0.068 0.144 0.030* 0.152 0.012 TEA 0.483 0.045 0.583 0.072 0.063 0.010 0.330 0.066 0.239 0.039 0.150 0.012 F/TEA 0.457 0.037 0.576 0.071 0.073 0.010* 0.338 0.067 0.166 0.023* 0.154 0.011

The two-sample t-test for difference in means is conducted on the results. An asterisk indicates statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. For UKCF, only PRC metrics for infections (I) and comorbidities (C) are shown due to space limitation; for ADNI, MSE metrics are reported separately for targets representing biomarkers (B) and cognitive tests (C). See Tables 13 14, 17 18, and 21 22.

5.1 MAIN RESULTS

Overall Beneﬁt. First, we examine the overall utility of TEAs. To verify the linear case ﬁrst, Table 2 summarizes the performance of TEA and alternate setups on UKCF. The temporal axis is ﬂattened to simulate ordinary static multi-label classiﬁcation, and the base case is direct prediction (Base) that is, absent any auxiliary representation learning or regularization. Next, we allow for ℓ2-regularization over direct prediction (REG), as well as over all other methods. FEAs differ only by the added feature-reconstruction, and TEAs only by the target-reconstruction; as an additional sensitivity, we also implement a combined approach (F/TEA). More generally, we also wish to examine the beneﬁt of TEA for the nonlinear case: Table 3 summarizes analogous results where component functions are implemented with GRU networks; results are shown for all datasets. Ceteris paribus, we observe that target-representation learning has a notable positive impact on performance. Interestingly, learning representations of inputs does not yield signiﬁcant beneﬁt, and the hybrid approach (F/TEA) is worse than TEA; this suggests that forcing intermediate representation to encode both features and targets may be overly constraining. (Note that for the linear model, the instances are restricted to those for which the full input window is available; as a consequence, the results for linear and nonlinear cases are not directly comparable). Figures 4 (Appendix C) and 5 (Appendix D) give training diagrams for all comparators. Additional experiment results (by model, timestep, metric) are in Appendix E.1-2.

Source of Gain. There are two (related) interpretations of TEAs. First, we studied the regularization view in Section 3; this concerns the beneﬁt of joint training using both prediction and reconstruction losses. Ceteris paribus, we expect performance to improve purely by dint of the jointly trained TEA objective. Second, the reduction view says that TEAs decompose the (difﬁcult) prediction problem into two (smaller) tasks: the autoencoder learns a compact representation z of y, and the predictor learns to map x to z. This makes the potential beneﬁt of staged training (Section 2 and Appendix C) intuitively clear, and suggests an alternative that of simply training the autoencoder and predictor arms in two stages à la Mostajabi et al. (2018). As a general framework, TEAs is a combination of both ideas: all three components are jointly trained in a third stage à la Girdhar et al. (2016). We now account for the improvement in performance due to these two sources of beneﬁt; Table 4 does so for the linear case (on UKCF), and Table 5 for the more general nonlinear case (on all datasets). The No Joint setting isolates the beneﬁt from staged training only. This is analogous to basic unsupervised pretraining (though using targets), and corresponds to omitting the ﬁnal joint training stage in Algorithm 1. The No Staged setting isolates the beneﬁt from joint training only (without pretraining the autoencoder or predictor), and corresponds to omitting the ﬁrst two training stages in Algorithm 1. The Neither setting is equivalent to vanilla prediction (REG) without leveraging either of the advantages. We observe that while both sources of beneﬁt are individually important, neither setting performs quite as well as when both are combined. See Appendix E.1-2 for extended results.

Published as a conference paper at ICLR 2020

Table 4: Summary source of gain and TEA variants on linear model with UKCF (Bold indicates best)

Neither No Joint No Staged TEA TEA(L) TEA(LP)

PRC(I) 0.347 0.085 0.402 0.026 0.431 0.031 0.450 0.035 0.435 0.031 0.454 0.036 PRC(C) 0.433 0.083 0.507 0.040 0.543 0.054 0.559 0.060 0.544 0.053 0.560 0.061

ROC(I) 0.710 0.072 0.747 0.022 0.764 0.022 0.767 0.026 0.759 0.025 0.768 0.028 ROC(C) 0.700 0.075 0.744 0.038 0.766 0.038 0.767 0.042 0.760 0.042 0.767 0.042

No Joint omits ﬁnal joint training, and No Staged skips (pre)-training stages. PRC and ROC metrics are reported separately for targets representing infections (I) & comorbidities (C). See Tables 11 12 for extended results.

Table 5: Summary source of gain and TEA variants on nonlinear (RNN) model (Bold indicates best)

UKCF (Binary Targets) ADNI (Continuous Targets) MIMIC (Mixed Targets)

PRC(I) PRC(C) MSE(B) MSE(C) PRC MSE

Neither 0.415 0.030 0.518 0.052 0.096 0.014 0.360 0.066 0.143 0.019 0.152 0.010 No Joint 0.455 0.039 0.574 0.069 0.092 0.014 0.353 0.070 0.183 0.038 0.151 0.011 No Staged 0.424 0.031 0.543 0.061 0.106 0.022 0.363 0.067 0.167 0.022 0.150 0.012 TEA 0.483 0.045 0.583 0.072 0.063 0.010 0.330 0.066 0.239 0.039 0.150 0.012 TEA(L) 0.483 0.047 0.581 0.074 0.058 0.012 0.330 0.076 0.249 0.049 0.149 0.012 TEA(LP) 0.480 0.044 0.583 0.072 0.064 0.012 0.329 0.068 0.229 0.039 0.151 0.011

No Joint omits ﬁnal joint training, and No Staged skips (pre)-training stages. For UKCF, only PRC metrics for infections (I) and comorbidities (C) are shown due to space limitation; for ADNI, MSE metrics are reported separately for targets representing biomarkers (B) and cognitive tests (C). See Tables 15 16, 19 20, and 23 24.

Variations. Having established the utility of target-embedding, we can ask whether variations on the same theme perform similarly. In particular, the embeddings in the empirical studies of Girdhar et al. (2016) and Yeh et al. (2017) are jointly learned via the reconstruction loss ℓr and latent loss ℓz that is, the prediction arm continues to regress learned embeddings during the joint training stage (Figure 4(d), in Appendix D). The principle is similar, although (as noted in Section 4) the primary task is therefore learned indirectly; this is in contrast to the vanilla TEA setup, where the primary task is learned directly via the prediction loss ℓp. Tables 4 and 5 also compare the performance of vanilla TEAs with this indirect variant (TEA(L)), as well as a hybrid variant (TEA(LP)) for which both latent and prediction losses are trained jointly with the reconstruction loss (Figure 4(e). Perhaps as expected, we observe that performance across all three variants are more or less identical, afﬁrming the general beneﬁt of target-representation learning. Again, see Appendix E.1-2 for extended results.

5.2 SENSITIVITIES

Regularization. Of course, target-representation learning is not a replacement for other regularization strategies; it is an additional tool that can be used in parallel where appropriate. Figure 3(a) shows the performance of TEA and REG with various coefﬁcients ν on ℓ2-regularization. By itself, introducing ℓ2-regularization does improve performance up to a certain point, beyond which the additional shrinkage bias incurred begins to be counterproductive; this is not surprising. Interestingly, introducing target-representation learning appears to leverage an orthogonal bias: it consistently improves prediction performance regardless of level of shrinkage. This is a practical result of the theoretical observation in Remark 3: while prior works obtain stability through explicit ℓ2-regularization, the beneﬁt from target-embedding relies on a different bias entirely, which allows us to combine them. While increasing the strength of either form of regularization reduces variability in results (see also below), excessive bias of either alone degrades performance. See Appendix E.3 for full results.

Strength of Prior. Target-embedding attempts to leverage the assumption that there exist compact and predictable representations of targets. Realistically (e.g. due to measurement noise), of course, this will not hold perfectly. In our experiments, we set the ratio of prediction and reconstruction losses to be 1 : 1 for TEA (as well as FEA and F/TEA); that is, the strength-of-prior coefﬁcient λ on ℓr is 0.5. In order to isolate the effect of λ during joint training, we observe the performance of TEAs with joint training only (i.e. removing the confounding effect of staged training). For large values of λ, we expect the reconstruction task to dominate in priority, which is (under an imperfect prior) not beneﬁcial for the ultimate prediction task in general, a hidden representation that is most reconstructive is not necessarily also what is most predictable). For small values of λ, the setup begins to resemble direct prediction. Figure 3(b) veriﬁes our intuition. Note that in the extreme case of λ = 1, predictions are no better than random (see ROC 0.5). See Appendix E.3 for full results.

Published as a conference paper at ICLR 2020

(a) Sensitivities on ν

(b) Sensitivities on λ

(c) Sensitivities on N

Figure 3: Sensitivities on ℓ2-regularization coefﬁcient ν, strength-of-prior coefﬁcient λ, and training size N for direct prediction (REG) and with target-embedding (TEA) on linear model with UKCF. For sensitivities on λ, we perform joint training only, so that we isolate the effect of the joint reconstruction task (i.e. removing the confounding effect of staged training). Standard errors are indicated with shaded regions. For full results, see Tables 25 28 for sensitivities on ν, Tables 29 30 for sensitivities on λ, and Tables 31 34 for sensitivites on N.

Sample Complexity. Figure 3(c) shows the performance of TEA and REG under various levels of data scarcity. The beneﬁt conferred by TEAs is signiﬁcant especially when the amount of training data N is limited. Importantly, note that we are operating strictly within the context of supervised learning: unlike in semi-supervised settings, here we are not just restricting access to paired data; we are restricting access to data per se. (Though beyond the scope of this paper, we expect that extending TEAs to semi-supervised learning with additional unpaired data would yield larger gains). Here, without the luxury of learning from unpaired data, we highlight the comparative advantage purely from the addition of target-representation learning. Again, see Appendix E.3 for full results.

5.3 DISCUSSION

By way of conclusion, we emphasize the importance of our central assumption: that there exist compact and predictable representations of the (high-dimensional) targets. This is critical: targetembedding is not useful where this is not true. Now obviously, learning representations of targets is unnecessary if the output dimension is trivially small (e.g. if the target is a single classiﬁcation label), or if the problem itself is trivially easy (e.g. if direct prediction is already perfect). Also obvious is the situation where representations cannot possibly be compact (e.g. if all output dimensions are independent of each other), in which case any model with a compressive (bottleneck) representation as an intermediate target may make little sense to begin with. Perhaps less obvious is that we cannot assume that the goals of prediction and reconstruction are always aligned. Just as in learning featureembeddings (for downstream classiﬁcation), what is most reconstructive may not necessarily encode what is most discriminative; so too in learning target-embeddings (for upstream prediction), what is most reconstructive may not necessarily encode what is most predictable. In the case of disease trajectories, it is medical knowledge that permits this assumption with some conﬁdence. Appendix E.4 gives an extreme (synthetic) counterexample where this prior is outright false i.e. prediction and reconstruction are directly at odds. While certainly contrived, it serves as a caveat about assumptions.

Using the deliberately challenging setting of disease trajectory forecasting with limited information, we have illustrated the nontrivial utility of target-representation learning in a controlled setting with baseline models. While we appreciate that component models in the wild may be more tailored, this setting better allows us to identify and isolate the potential utility of target-autoencoding per se. In addition to verifying our intuitions for the linear case, we have extended empirical validation of target-autoencoding to (nonlinear) sequence-to-sequence recurrent architectures; along the way, we explored the sources of gain from joint and staged training, as well as various sensitivities of interest. Where the prior holds, target-embedding autoencoders are potentially applicable to any high-dimensional prediction task beyond static classiﬁcation and imaging applications, and exploring its utility for other speciﬁc domain-architectures may be a practical direction for future research.

Published as a conference paper at ICLR 2020

ACKNOWLEDGMENTS

This work was supported by Alzheimer s Research UK (ARUK), the US Ofﬁce of Naval Research (ONR), and the National Science Foundation (NSF): grant numbers ECCS1462245, ECCS1533983, and ECCS1407712. We thank the UK Cystic Fibrosis Trust, the Alzheimer s Disease Neuroimaging Initiative, and the MIT Lab for Computational Physiology respectively for making the UKCF, ADNI, and MIMIC datasets available for research. We thank the reviewers for their helpful comments.

Ahmed M. Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progression. In 2019 Conference on Neural Information Processing Systems, 2019.

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. International Conference on Machine Learning, 2018.

Krishnakumar Balasubramanian and Guy Lebanon. The landmark selection method for multiple output prediction. Proceedings of the 29th International Conference on Machine Learning, 2012.

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240 6249, 2017.

Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 1046 1059. ACM, 2016.

Jonathan Baxter. A model of inductive bias learning. Journal of artiﬁcial intelligence research, 12: 149 198, 2000.

Inci M Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K Jain, and Jiayu Zhou. Patient subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 65 74. ACM, 2017.

Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines, pp. 567 580. Springer, 2003.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE trans. on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

Nikhil Bhagwat, Joseph D Viviano, Aristotle N Voineskos, M Mallar Chakravarty, Alzheimer s Disease Neuroimaging Initiative, et al. Modeling and prediction of clinical symptom trajectories in alzheimer s disease using longitudinal data. PLo S computational biology, 14(9):e1006376, 2018.

Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classiﬁcation. In Advances in Neural Information Processing Systems, pp. 730 738, 2015.

Kush Bhatia, Kunal Dahiya, Himanshu Jain, Yashoteja Prabhu, and Manik Varma. The extreme classiﬁcation repository, 2019. URL http://manikvarma.org/downloads/XC/ XMLRepository.html.

Wei Bi and James Kwok. Efﬁcient multi-label classiﬁcation with many labels. In International Conference on Machine Learning, pp. 405 413, 2013.

Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, et al. Domain separation networks. In Advances in neural information processing systems, pp. 343 351, 2016.

Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002.

Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown. Learning multi-label scene classiﬁcation. Pattern recognition, 37(9):1757 1771, 2004.

Published as a conference paper at ICLR 2020

Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997.

Rui M. Castro and Robert D. Nowak. General Bounds for Bounded Losses, in Statistical Learning Theory. 2018.

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610 2620, 2018.

Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for multi-label classiﬁcation. In Advances in Neural Information Processing Systems, pp. 1529 1537, 2012.

Moustapha M Cisse, Nicolas Usunier, Thierry Artieres, and Patrick Gallinari. Robust bloom ﬁlters for large multilabel classiﬁcation tasks. In Advances in Neural Information Processing Systems, pp. 1851 1859, 2013.

Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079 3087, 2015.

Adrian V Dalca, John Guttag, and Mert R Sabuncu. Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9290 9299, 2018.

Luc Devroye and Terry Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202 207, 1979.

Michael C Donohue, Hélène Jacqmin-Gadda, Mélanie Le Goff, et al. Estimating long-term multivariate progression from short-term data. Alzheimer s & Dementia, 10(5):S400 S410, 2014.

Baruch Epstein and Ron Meir. Generalization bounds for unsupervised and semi-supervised learning with autoencoders. ar Xiv preprint ar Xiv:1902.01449, 2019.

Vitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems, pp. 9747 9757, 2018.

Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. Multilabel classiﬁcation via calibrated label ranking. Machine learning, 73(2):133 153, 2008.

Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, and Wen Li. Deep reconstruction-classiﬁcation networks for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 597 613. Springer, 2016.

Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In European Conference on Computer Vision, pp. 484 499. Springer, 2016.

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105 118, 2016.

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504 507, 2006.

Daniel J Hsu, Sham M Kakade, John Langford, and Tong Zhang. Multi-label prediction via compressed sensing. In Advances in neural information processing systems, pp. 772 780, 2009.

Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classiﬁcation using bayesian compressed sensing. In Advances in Neural Information Processing Systems, pp. 2645 2653, 2012.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581 3589, 2014.

Published as a conference paper at ICLR 2020

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902 1914, 2001.

Lei Le, Andrew Patterson, and Martha White. Supervised autoencoders: Improving generalization performance with unsupervised regularizers. In Advances in Neural Information Processing Systems, pp. 107 117, 2018.

Xin Li and Yuhong Guo. Multi-label classiﬁcation with feature-aware non-linear label space transformation. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. Multi-label classiﬁcation via featureaware implicit label space encoding. In International conference on machine learning, 2014.

Tongliang Liu, Dacheng Tao, Mingli Song, and Stephen J Maybank. Algorithm-dependent generalization bounds for multi-task learning. IEEE transactions on pattern analysis and machine intelligence, 39(2):227 241, 2016.

Gábor Lugosi and Miroslaw Pawlak. On the posterior-probability estimate of the error rate of nonparametric classiﬁcation rules. IEEE Trans. on Information Theory, 40(2):475 481, 1994.

Xinrui Lyu, Matthias Hüser, Stephanie L Hyland, George Zerveas, and Gunnar Rätsch. Improving clinical predictions through unsupervised time series representation learning. Neur IPS 2018 Machine Learning for Health (ML4H) workshop, 2018.

Andreas Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, 7 (Jan):117 139, 2006.

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The beneﬁt of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016.

Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989.

Mehryar Mohri, Afshin Rostamizadeh, and Dmitry Storcheus. Generalization bounds for supervised dimensionality reduction. In Feature Extraction: Modern Questions and Challenges, pp. 226 241, 2015.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

Mohammadreza Mostajabi, Michael Maire, and Gregory Shakhnarovich. Regularizing deep networks by modeling and predicting label structure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5629 5638, 2018.

Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufﬁcient for generalization and necessary and sufﬁcient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161 193, 2006.

Siddharth Narayanaswamy, T Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, pp. 5925 5935, 2017.

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrallynormalized margin bounds for neural networks. International Conference on Learning Representations 2018, 2017.

Ozan Oktay, Enzo Ferrante, Konstantinos Kamnitsas, Mattias Heinrich, Wenjia Bai, Jose Caballero, Stuart A Cook, Antonio De Marvao, Timothy Dawes, Declan P O Regan, et al. Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation. IEEE transactions on medical imaging, 37(2):384 395, 2017.

Published as a conference paper at ICLR 2020

Tharick A Pascoal, Sulantha Mathotaarachchi, Monica Shin, Andrea L Benedet, Sara Mohades, Seqian Wang, Tom Beaudry, Min Su Kang, Jean-Paul Soucy, Aurelie Labbe, et al. Synergistic interaction between amyloid and tau predicts the progression to dementia. Alzheimer s & Dementia, 13(6):644 653, 2017.

Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. Predicting healthcare trajectories from medical records: A deep learning approach. Journal of biomedical informatics, 69:218 229, 2017.

David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 1984.

Piyush Rai, Changwei Hu, Ricardo Henao, and Lawrence Carin. Large-scale bayesian multi-label learning via topic-based label embeddings. In Advances in Neural Information Processing Systems, pp. 3222 3230, 2015.

Marc Aurelio Ranzato and Martin Szummer. Semi-supervised learning of compact document representations with deep networks. In Proceedings of the 25th international conference on Machine learning, pp. 792 799. ACM, 2008.

Marc Aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L Cun, et al. Efﬁcient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pp. 1137 1144, 2007.

Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in neural information processing systems, pp. 3546 3554, 2015.

Philippe Rigollet. Generalization error bounds in semi-supervised classiﬁcation under the cluster assumption. Journal of Machine Learning Research, 8(Jul):1369 1392, 2007.

William H Rogers and Terry J Wagner. A ﬁnite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pp. 506 514, 1978.

Lorenzo Rosasco and Tomaso Poggio. Stability of tikhonov regularization. MIT 9.520 Statistical Machine Learning, 2009.

M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh, 2000.

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635 2670, 2010.

Aarti Singh, Robert Nowak, and Jerry Zhu. Unlabeled data: Now it helps, now it doesn t. In Advances in neural information processing systems, pp. 1513 1520, 2009.

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843 852, 2015.

Rhonda D Szczesniak, Dan Li, Weiji Su, Cole Brokamp, John Pestian, Michael Seid, and John P Clancy. Phenotypes of rapid cystic ﬁbrosis lung disease progression during adolescence and young adulthood. American journal of respiratory and critical care medicine, 196(4):471 478, 2017.

Farbound Tai and Hsuan-Tien Lin. Multilabel classiﬁcation with principal label space transformation. International Conference on Machine Learning workshop on Learning from Multi-Label Data, 2010.

Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based representation learning. Neur IPS 2018 Bayesian Deep Learning workshop, 2018.

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pp. 667 685. Springer, 2009.

Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306 6315, 2017.

Published as a conference paper at ICLR 2020

Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theoretical Probabibility and its Applicactions, 17:264 280, 01 1971. doi: 10.1137/1116025.

Kaixiang Wang, Ming Yang, Wanqi Yang, and Yi Long Yin. Deep correlation structure preserved label space embedding for multi-label classiﬁcation. In Asian Conference on Machine Learning, pp. 1 16, 2018.

Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639 655. Springer, 2012.

Baoyuan Wu, Zhilei Liu, Shangfei Wang, Bao-Gang Hu, and Qiang Ji. Multi-label learning with missing labels. In 22nd International Conference on Pattern Recognition, pp. 1964 1968. IEEE, 2014.

Yan Yan, Glenn Fung, Jennifer G Dy, and Romer Rosales. Medical coding classiﬁcation by leveraging inter-code relationships. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 193 202. ACM, 2010.

Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. Learning deep latent space for multi-label classiﬁcation. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In International conference on machine learning, pp. 593 601, 2014.

Yi Zhang and Jeff Schneider. Multi-label output codes using canonical correlation analysis. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pp. 873 882, 2011.

Yi Zhang and Jeff Schneider. Maximum margin output coding. Proceedings of the 29th International Conference on Machine Learning, 2012.

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from deep generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4091 4099. JMLR. org, 2017.

Wen-Ji Zhou, Yang Yu, and Min-Ling Zhang. Binary linear compression for multi-label classiﬁcation. In IJCAI, pp. 3546 3552, 2017.

Fuzhen Zhuang, Xiaohu Cheng, Ping Luo, Sinno Jialin Pan, and Qing He. Supervised representation learning: Transfer learning with deep autoencoders. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

Published as a conference paper at ICLR 2020

A PROOF OF THEOREM 1

Deﬁnition 5 (Bregman Distance) Let ℓbe some differentiable, strictly convex loss. For any h, h H, the Bregman distance associated with ℓis given by Dℓ(h h) = ℓ(h ) ℓ(h) h h, ℓ(h) . Additivity and non-negativity are easy to see; that is, Dℓ(h h) 0 for any h, h H, and if ℓ= ℓ1 + ℓ2 and both ℓ1 and ℓ2 are convex functions, then Dℓ(h h) = Dℓ1(h h) + Dℓ2(h h).

Theorem 1 (Uniform Stability) Let ℓp and ℓr be σp-admissible and σr-admissible loss functions, and let ℓr be c-strongly convex. Then under Assumptions 1 and 2, the following inequality holds,

|ℓp(Θ Wux, y) ℓp(Θ Wux, y)| 2(σ2 pr2 α + σpσrrαrβ)a M

Proof. The overall strategy consists of three steps in sequence. First, we want to bound the delta in prediction loss under Θ and Θ using the set of representative vectors B. Second, we bound the resulting expression in terms of the Bregman divergence of the complete loss functions L and L under Θ and Θ . Third, we express the divergence back in terms of the original expression itself (consisting of representative vectors), which allows us to solve for a bound on that expression. Finally, combining the results from the three steps completes the proof. We begin with the left-hand term,

|ℓp(Θ Wux, y) ℓp(Θ Wux, y)| σp (Θ Θ )Wux 2

m=1 αm(Θ Θ )Webm

m=1 (Θ Θ )Webm 2 2 (3)

where the ﬁrst inequality follows from the fact that ℓp is σp-admissible, the equality follows from Assumption 1 for some coefﬁcients αm R, and the second inequality follows from the Cauchy Schwarz inequality. As our second step, the goal is to upper-bound the term under the square-root,

m=1 (Θ Θ )Webm 2 2

m=1 (Θ Θ )Webm, ℓr(Θ Webm, bm) ℓr(Θ Webm, bm)

Θ Θ , ℓr(Θ Webm, bm)b m W e ℓr(Θ Webm, bm)b m W e

c DLB r (Θ Θ ) + DLB r (Θ Θ )

where the ﬁrst inequality follows from the fact that ℓr is c-strongly convex, and the ﬁnal equality follows from the deﬁnition of Bregman divergence (the standalone loss terms cancel). We want an expression in terms of the loss functions L and L , which will subsequently allow us to obtain a bound expressed back in terms of the set of representative vectors. Focusing on the term in the brackets,

DLB r (Θ Θ ) + DLB r (Θ Θ )

= Θ Θ , LB r (Θ ) Θ Θ , LB r (Θ )

= lim κ 0+ 1 κ LB r (Θ ) LB r (κΘ + (1 κ)Θ ) + LB r (Θ ) LB r (κΘ + (1 κ)Θ )

lim κ 0+ a κ [L r(Θ ) L r(κΘ + (1 κ)Θ ) + L r(Θ ) L r(κΘ + (1 κ)Θ )]

= a [ Θ Θ , L r(Θ ) Θ Θ , L r(Θ ) ]

= a DL r(Θ Θ ) + DL r(Θ Θ )

a [DL(Θ Θ ) + DL (Θ Θ )] (5)

Published as a conference paper at ICLR 2020

where the ﬁrst and last equalities follows from the deﬁnition of Bregman divergence, and the second equality from the deﬁnition of directional derivatives. The ﬁrst inequality follows from Assumption 2; for the second inequality, note that L r consists of a strict subset of the set of strictly convex losses that L consists of, and similarly for L . Therefore by additivity and non-negativity of the Bregman distance, we have that DL r(Θ Θ ) DL(Θ Θ ) and DL r(Θ Θ ) DL (Θ Θ ). Our third and ﬁnal step is to go back and bound this term again using the set of representative vectors,

DL(Θ Θ ) + DL (Θ Θ )

= L(Θ ) L(Θ ) + L (Θ ) L (Θ )

= [L(Θ ) L (Θ )] [L(Θ ) L (Θ )]

N [ℓp(Θ Wuxi, yi) + ℓr(Θ Weyi, yi)] 1

N [ℓp(Θ Wux i, y i) + ℓr(Θ Wey i, y i)]

N [ℓp(Θ Wuxi, yi) + ℓr(Θ Weyi, yi)] + 1

N [ℓp(Θ Wux i, y i) + ℓr(Θ Wey i, y i)]

N ( (Θ Θ )Wuxi 2 + (Θ Θ )Wux i 2)

N ( (Θ Θ )Weyi 2 + (Θ Θ )Wey i 2)

2(σprα + σrrβ)

m=1 (Θ Θ )Webm 2 2 (6)

where for ﬁrst equality note that by construction the gradients of the losses L and L are zero at the respective optimal models Θ and Θ . The ﬁrst inequality follows from the fact that ℓp and ℓr are σp-admissible and σr-admissible respectively, and the second inequality follows from Assumption 1 and the Cauchy-Schwarz inequality. Now, combining Equations 4, 5, and 6 allows us to write

m=1 (Θ Θ )Webm 2 2 2(σprα + σrrβ)a M

which by substitution into Equation 3 completes the proof.

Remark 4 Formulating the autoencoding component as an auxiliary task allows us to unambiguously interpret its beneﬁt as a regularizer. Speciﬁcally, the complete loss can be summarized and rewritten as L(Θ) = Lp(Θ) + R1(Θ) + R2(Θ); that is, the TEA objective is a combination of the primary prediction loss Lp(Θ) = 1

N PN n=1 ℓp(ΘWuxn, yn) plus additional regularization, where R1(Θ) = 1 N PM m=1 ℓr(ΘWebm, bm) = M

N LB r (Θ) and R2(Θ) = 1 N PN n=1 ℓr(ΘWeyn, yn) M

N LB r (Θ). In particular, the proof of Theorem 1 relies on R1(Θ) to upper-bound instability (Appendix A). This precisely identiﬁes the regularizer in question, while Theorem 1 quantiﬁes its generalization beneﬁt.

Remark 5 Technicality: Moving from uniform stability (Theorem 1) to generalization bounds (Corollary 1) requires that the loss function not take on arbitrarily large values (see e.g. Bousquet & Elisseeff (2002)). In practice, the label space itself is often bounded (see e.g. Assumption 1 in Le et al. (2018)); then the problem effectively reduces to that with a bounded loss function. For example, consider a regression setting using the quadratic loss function, where the target data lie within [ U, U]; then the loss function is bounded to be within [0, 4U 2]. See Castro & Nowak (2018).

Remark 6 Technicality: Earlier we assumed ε = 0 for ease of exposition; carrying around the extra O(ε) term (from Equations 3 and 6) is not particularly illuminating. Generalizing to ε = 0 can be done in a similar manner to Le et al. (2018), with the additional assumptions of bounded spaces and that ε decreases as 1/N (which they note is reasonable, since the more samples in the data, the more likely the cross-representativity assumption will hold with low error). Again, note that ε = 0 holds as long as the number of independent latent vectors is at least |Z|). Similarly, Liu et al. (2016) consider ε = 0, noting in any case that they can increase N to obtain a small η 2; see their analysis for detail.

Published as a conference paper at ICLR 2020

B EXPANDED RELATED WORK

In this paper, we motivate and analyze a general autoencoder-based target-representation learning technique in the supervised setting, quantifying the generalization beneﬁt via an argument from uniform stability, as well as verifying its practical utility. As such, our work lies at the intersection of three threads of research: (1) supervised representation learning using autoencoders, (2) label space reduction for multi-label classiﬁcation, as well as (3) algorithmic stability-based learning guarantees.

B.1 Supervised Representation Learning using Autoencoders

Table 6: Autoencoder-Based Supervised Representation Learning

Work Contribution Setting Type Embedding

Weston et al. (2012) Jointly optimized classiﬁcation and embedding

Semi-supervised learning Deterministic Features

Kingma et al. (2014) Variational inference for generative modeling

Semi-supervised learning Probabilistic Features

Narayanaswamy et al. (2017)

Disentangled latent representations

Semi-supervised learning Probabilistic Features

Zhuang et al. (2015) Inputand output-encoding for transfer learning

Semi-supervised learning Deterministic Features

Bousmalis et al. (2016) Paired autoencoders for transfer learning

Semi-supervised learning Deterministic Features

Ghifary et al. (2016) Jointly optimized feature-embedding

Semi-supervised learning Deterministic Features

Le et al. (2018) Jointly optimized feature-embedding

Supervised learning Deterministic Features

Dalca et al. (2018) Generative model using learned target priors

Unsupervised, Unpaired Probabilistic Targets

Girdhar et al. (2016) Jointly optimized targetembedding (Indirect)

Supervised learning Deterministic Targets

(Ours) Jointly optimized targetembedding (Direct)

Supervised learning Deterministic Targets

Autoencoder-based representation learning (Hinton & Salakhutdinov, 2006) has long played important roles in unsupervised and semi-supervised settings. Various inductive biases have been proposed to promote representations that are sparse (Ranzato et al., 2007), discrete (van den Oord et al., 2017), factorized (Chen et al., 2018), or hierarchical (Zhao et al., 2017), among others. For a more thorough overview of various methods, we refer the reader to Bengio et al. (2013) and Tschannen et al. (2018).

The goal of better representations is often for the beneﬁt of downstream tasks. Semi-supervised autoencoders trained on partially-labeled data can be jointly optimized to obtain compact representations that improve generalization on supervised tasks (Ranzato & Szummer, 2008; Weston et al., 2012). This naturally extends to representations that are generative (Kingma et al., 2014), disentangled (Narayanaswamy et al., 2017), or hierarchical (Rasmus et al., 2015). In addition, semi-supervised autoencoders enable transfer learning across different domains via jointly-trained reconstruction-classiﬁcation networks Ghifary et al. (2016), private-shared partitioned representations (Bousmalis et al., 2016), or by augmenting models with label-encoding layers (Zhuang et al., 2015).

Although less studied, more closely related to our work is the use of autoencoders in a purely supervised setting: rather than focusing on how speciﬁc architectural novelties may better structure unlabeled data, Le et al. (2018) instead study the generalization beneﬁt by the simple addition of reconstruction to the supervised classiﬁcation task a special case of what we describe as FEAs. Now, all aforementioned studies operate on the basis of autoencoding features (for an explicit or implicit downstream prediction task). In this paper, we instead focus on autoencoder-based target-representation learning (using TEAs) in the supervised setting, and importantly analyze the theoretical and empirical beneﬁts of the approach. Unlike in simple classiﬁcation (for which FEAs make sense), we are motivated by problems with high-dimensional output spaces, but where we operate under the assumption of a more compact and predictable set of underlying factors.

We take inspiration from the empirical investigation of Girdhar et al. (2016), where latent representations of 3D objects (targets) are jointly trained to be predictable from 2D images (features);

Published as a conference paper at ICLR 2020

Table 7: Label Space Dimension Reduction via Label Embedding

Work Contribution Learning Problem

Hsu et al. (2009) Coding with compressed sensing (random projections) Separate Multi-label classiﬁcation

Tai & Lin (2010) Coding with principal label space transformation (PC-based projections) Separate Multi-label classiﬁcation Zhang & Schneider (2011)

Coding with maximum margin (between prediction distances) Separate Multi-label classiﬁcation Balasubramanian & Lebanon (2012)

Landmark selection of labels via group-sparse learning Separate Multi-label classiﬁcation

Bi & Kwok (2013) Landmark selection of labels via randomized sampling Separate Multi-label classiﬁcation

Chen & Lin (2012) Feature-aware principal label space transformation for embedding Joint Multi-label classiﬁcation

Yu et al. (2014) Generic empirical risk minimization formulation and bounds Joint Multi-label classiﬁcation

Yeh et al. (2017) Autoencoder embedding with canonical correlation analysis Joint Multi-label classiﬁcation

Mostajabi et al. (2018) Autoencoder component as (Implicit) regularization for learning predictor Separate General; Semantic segmentation

(Ours) Autoencoder component as (Explicit) regularization for learning predictor Joint General; Sequence forecasting

Oktay et al. (2017) deploy a similar approach for medical image segmentation. On the other hand, in both cases the supervised task is truncated: the predictors are trained to regress the unsupervised embeddings (instead of ground-truth targets), and gradients only backpropagate from the latent space (instead of the target space). This means that their common decoder function is only shared indirectly (and predictions made indirectly), versus the symmetric and simultaneously optimized forward model proposed for TEAs an important distinction that our analysis relies on to obtain uniform stability. In Mostajabi et al. (2018), a two-stage procedure is used for semantic segmentation loosely comparable to the ﬁrst two stages in TEAs; in contrast to our emphasis on joint training, they study the beneﬁt of a frozen embedding branch in parallel with direct prediction. More broadly related to target-embedding, Dalca et al. (2018) build anatomical priors for biomedical segmentation in unsupervised settings.

B.2 Label Space Reduction for Multi-Label Classiﬁcation

Label space dimension reduction comprises techniques that focus speciﬁcally on multi-label classiﬁcation. Early approaches to multi-label classiﬁcation employ simplistic transformations such as label power-sets (Boutell et al., 2004), binary relevance (Tsoumakas et al., 2009), and label rankings (Fürnkranz et al., 2008); these are computationally inefﬁcient, and do not capture interdependencies between labels. In contrast, label-embedding methods ﬁrst derive a latent label space with reduced dimensionality, and subsequently associate inputs to that latent space instead. Encodings have been obtained via random projections (Hsu et al., 2009), principal components-based projections (Tai & Lin, 2010), canonical correlation analysis (Zhang & Schneider, 2011), as well as maximum-margin coding (Zhang & Schneider, 2012). A parallel thread of research has focused on selecting representative and reconstructive subsets of labels through group-sparse learning (Balasubramanian & Lebanon, 2012) and randomized sampling (Bi & Kwok, 2013). Various extensions of label-embedding techniques abound, such as using bloom ﬁlters (Cisse et al., 2013), nearest-neighbors (Bhatia et al., 2015), handling missing data (Wu et al., 2014), as well as using binary compression (Zhou et al., 2017).

Closer our theme of joint learning by utilizing both features and targets, Chen & Lin (2012) ﬁrst proposed simultaneously minimizing the encoding error (from labels) and prediction error (from features) through an SVD formulation. Towards more ﬂexible learning, Lin et al. (2014) did away with explicitly speciﬁed encoding functions, proposing to learn code matrices directly making no assumptions whatsoever. Unifying several prior methods, Yu et al. (2014) cast label-embedding within the generic empirical risk minimization framework as learning a linear model with a low-rank constraint; this perspective captures the generic intuition of a restricted number of latent factors, and admits generalization bounds based on norm-based regularization. Recently, Yeh et al. (2017) generalized the label-embedding approach to autoencoders; this formulation ﬂexibly allows the

Published as a conference paper at ICLR 2020

Table 8: Generalizability, Multiple Tasks, and Algorithmic Stability

Work Contribution Setting Focus Learning

Bousquet & Elisseeff (2002)

Uniform stability for generalization

Supervised, general Single task -

Feldman & Vondrak (2018)

Improve on Bousquet & Elisseeff s bound

Supervised, general Single task -

Baxter (2000) VC-dimension for generalization

Multi-task learning All tasks Jointly

Maurer (2006) Rademacher complexity for generalization

Multi-task learning All tasks Jointly

Liu et al. (2016) Uniform stability for generalization

Multi-task learning All tasks Jointly

Mohri et al. (2015) Rademacher complexity for generalization

Featureembedding Primary task Separately

Epstein & Meir (2019) Non-contractiveness and semi-supervision

Featureembedding Primary task Separately

Le et al. (2018) Uniform stability for generalization

Featureembedding Primary task Jointly

(Ours) Uniform stability for generalization

Targetembedding Primary task Jointly

addition of speciﬁc losses to exploit correlations a tactic also used in Wang et al. (2018) with multi-dimensional scaling. Furthermore, nonlinearities can be handled by deep learning in component functions, unlike earlier approaches limited to kernel methods (Lin et al., 2014; Li & Guo, 2015).

Our work is related to this general autoencoder approach to label-embedding, although there are signiﬁcant differences in focus. In particular, we operate at a higher level of abstraction. Labelembedding techniques worry about label reduction, and about speciﬁc loss functions that aim to preserve dependencies within and among spaces; their problem is one of multi-label classiﬁcation, and their baseline is binary relevance. In contrast, we worry about autoencoding at all that is, we focus on the regularizing effect of the reconstruction loss on learning the prediction model; our baseline is direct prediction, and the output can be of any form (classiﬁcation or regression). In light of the sizable performance improvement of the autoencoder-based model of Yeh et al. (2017) over comparators using direct prediction, our work can be regarded as a more generalized analysis of the contribution of the autoencoding component. Moreover, unlike the uniform convergence-based analysis in Yu et al. (2014), our bound does not rely on explicit norm-based regularization instead, we interpret the embedding task itself as an intrinsic form of regularization to derive our stability-based guarantee.

Finally, also worth mentioning is the ﬁeld of extreme multi-label classiﬁcation (Bhatia et al., 2015), for which probabilistic methods such as Rai et al. (2015) and Kapoor et al. (2012) present sophisticated approaches to extremely high-dimensional classiﬁcation problems with advantages in performance and use cases. In light of the medical relevance of our experimental setting, we point out the application of Yan et al. (2010) to medical coding. See Bhatia et al. (2019) for a more detailed overview.

B.3 Algorithmic Stability and Learning Guarantees

Generalizability is central to machine learning, and its analysis via hypothesis stability is ﬁrst studied in Rogers & Wagner (1978) and Devroye & Wagner (1979). Unlike arguments based on the complexity of the search space (Vapnik & Chervonenkis, 1971; Pollard, 1984; Koltchinskii, 2001), stability-based approaches account for how the model produced by the algorithm depends on the data. Based on concentration inequalities (Mc Diarmid, 1989), improved bounds are developed in Lugosi & Pawlak (1994) by estimating posterior error probabilities. The landmark work of Bousquet & Elisseeff (2002) ﬁrst formalizes the notion of uniform stability sufﬁcient for learnability, obtaining relatively strong bounds for several regularization algorithms, and Feldman & Vondrak (2018) recently use ideas related to differential privacy (Bassily et al., 2016) for further improved bounds without additional assumptions. For further context, see Mukherjee et al. (2006) and Shalev-Shwartz et al. (2010).

For semi-supervised representation learning, Rigollet (2007) ﬁrst introduces the notion of cluster excess-risk and convergence, formalizing the clustering criterion for unlabeled features to be useful (Seeger, 2000). Based on the clustering assumption, Singh et al. (2009) develops a ﬁnite sample analysis to quantify the performance improvement from unlabeled features. Focusing on autoencoders,

Published as a conference paper at ICLR 2020

Epstein & Meir (2019) adapt recent margin, norm, and compression-based results for deep networks (Bartlett et al., 2017; Neyshabur et al., 2017; Arora et al., 2018), and relate generalization of feature reconstructions to the beneﬁt of additional unlabeled features for the primary classiﬁcation task.

In the context of supervised problems, Mohri et al. (2015) and Gottlieb et al. (2016) analyze the generalization properties of dimensionality reduction techniques for features with respect to a downstream task; however, rather than joint training, the primary task is optimized subsequently over the learned representations. Taking a joint, multi-task approach (Caruana, 1997), Baxter (2000) ﬁrst leverages the inductive bias of a common optimal hypothesis class to obtain a VC-based generalization bound. Maurer (2006) and Maurer et al. (2016) argue from Rademacher complexity to illustrate the beneﬁt of the common operator; however, they only consider the task-averaged beneﬁt, whereas we want to focus speciﬁcally on the primary task. There has been some work on generalization for each task (Ben-David & Schuller, 2003), but limited to binary classiﬁcation contrary to our setting.

Arguing from stability, our approach is related to Liu et al. (2016) in showing that the algorithm for learning the shared model in a multi-task setting is uniformly stable. Our analysis also resembles Le et al. (2018) in the more speciﬁc setting where the bound for the primary prediction task is obtained with assistance from the auxiliary reconstruction loss; unlike Liu et al. (2016), we are not interested in a generic bound for all tasks. Again, however, the fundamental (and motivating) difference of our work stems from the (inverted) problem setting and resulting framework. In the vast majority of works, the primary task is one of classiﬁcation (or more generally |X| |Y|), where featureembeddings make sense to learn. Instead, we attend to the setting in which Y is high-dimensional (but where the underlying factors are assumed to be compact). In this setting, we argue (theoretically and empirically) that target-embeddings make more sense to learn in an auxiliary reconstruction task.

C EXPANDED ALGORITHM DETAIL

In the following, Algorithm 1 gives pseudocode for TEA training. Figure 4 gives detailed block diagrams of component functions and objectives corresponding to each training stage (and variant).

Algorithm 1 Pseudocode for TEA Training

Input: D = {(xn, yn)}N n=1 , learning rate ψ, minibatch size Ns Output: Parameters Θ, Wu, We of components θ, u, e

1: Initialize: Θ, Wu, We 2: while not converged do Stage 1: Learn Target-Embedding

3: Sample {(xn, yn)}Ns n=1 i.i.d. D 4: for n {1, ..., Ns} do 5: zn e(yn; We) Encode 6: yn θ(zn; Θ) Decode 7: end for 8: Lr 1 Ns PNs n=1ℓr( yn, yn) 9: We We ψ We Lr 10: Θ Θ ψ ΘLr 11: end while 12: while not converged do Stage 2: Regress Embeddings

13: Sample {(xn, yn)}Ns n=1 i.i.d. D 14: for n {1, ..., Ns} do 15: ˆzn u(xn; Wu) Predict 16: zn e(yn; We) Encode 17: end for 18: Lz 1 Ns PNs n=1ℓz(ˆzn, zn) 19: Wu Wu ψ Wu Lz 20: end while 21: while not converged do Stage 3: Joint Training

22: Sample {(xn, yn)}Ns n=1 i.i.d. D

Published as a conference paper at ICLR 2020

23: for n {1, ..., Ns} do 24: ˆzn u(xn; Wu) Predict 25: zn e(yn; We) Encode 26: ˆyn θ(ˆzn; Θ) Decode 27: yn θ(zn; Θ) Decode 28: end for 29: Lp 1 Ns PNs n=1ℓp(ˆyn, xn)

30: Lr 1 Ns PNs n=1ℓr( yn, yn) 31: Wu Wu ψ Wu Lp 32: We We ψ We Lr 33: Θ Θ ψ Θ [Lp + Lr] 34: end while 35: return Θ, Wu, We

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆz = u(x; Wu)

z = e(y; We)

(d) Training (Stage 3), TEA(L)

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

z = e(y; We)

(a) Training (Stage 1)

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆy = (ˆz; )

ˆz = u(x; Wu)

z = e(y; We)

(c) Training (Stage 3), TEA

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆz = u(x; Wu)

(b) Training (Stage 2)

z = e(y; We)

Figure 4: TEAs consist of a shared forward model θ, upstream predictor u, and target-embedding function e, parameterized by (Θ, Wu, We). Blue and red respectively identify the supervised and representation learning components in each arrangement. Solid lines indicate forward propagation of data; dashed lines indicate backpropagation of gradients. (a) First, the autoencoding components e, θ are trained to learn target representations. (b) Next, using the inputs, the prediction arm u is trained to regress the learned embeddings generated by the encoder. (c) Finally, all three components are jointly trained on both prediction and reconstruction losses. (d) In the indirect variant (Girdhar et al., 2016),

Published as a conference paper at ICLR 2020

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆy = (ˆz; )

ˆz = u(x; Wu)

z = e(y; We)

(e) Training (Stage 3), TEA(LP)

@ p @ @ p @Wu

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆy = (ˆz; )

ˆz = u(x; Wu)

(f) Inference Time

Figure 4: (continued) the predictor continues to regress the learned embeddings, and the latent loss backpropagates through both u and e (TEA(L)). (e) The TEA(LP) variant combines the previous two: both the latent loss and prediction loss are trained jointly together with the reconstruction loss. (f) At inference time, the target-embedding arm is dropped, leaving the hypothesis h = θ u for prediction.

D ADDITIONAL EXPERIMENT DETAIL

The UK Cystic Fibrosis registry records follow-up trajectories for over 10,000 patients over the period from 2008 and 2015 with a total of over 60,000 hospital visits. Each patient is associated with 90 variables over time, which includes data on treatments and diagnoses for 23 possible comorbidities (e.g. ABPA, diabetes, hypertension, pancreatitis), 11 possible infections (e.g. aspergillus, burkholderia cepacia, klebsiella pneumoniae), as well as static demographic information (e.g. gender, genetics, smoking status). Using both static and temporal information in a precedent window, we forecast the future trajectories for the diagnoses of infections and comorbidities (all binary variables) recorded at each follow-up. The Alzheimer s Disease Neuroimaging Initiative study tracks disease progression for over 1,700 patients over the period from 2004 to 2016 with a total of over 10,000 (bi-annual)

Latent Code

Feature Vector

ˆy = d(ˆz; Wd)

(a) Base; REG

Latent Code

Prediction Reconstruction

Feature Vector

x = r(ˆz; Wr) ˆy = d(ˆz; Wd)

ˆz = φ(x; Φ)

@ p @Wu ˆz = u(x; Wu)

Figure 5: Component functions and training objectives for comparators in experiments. Blue and red respectively identify the supervised and representation learning components in each arrangement. Solid lines indicate forward propagation of data; dashed lines indicate backpropagation of gradients. (a) The baseline is direct prediction (with (REG) and without (Base) ℓ2-regularization), which simply corresponds to removing the autoencoder; here we explicitly identify some intermediate hidden layer to preserve visual correspondence with the autoencoder models, but note that the latent code is strictly speaking a misnomer as nothing is being encoded here. (b) FEAs consist of a shared forward

Published as a conference paper at ICLR 2020

Latent Code

Feature Vector

x = r(ˆz; Wr)

ˆz = φ(x; Φ)

Reconstruction

Target Vector

z = e(y; We)

2 Y Reconstruction

ˆy = (ˆz; )

@ r Y @We @ r Y

Latent Code

Prediction Reconstruction

Target Vector Feature Vector

ˆy = (ˆz; )

ˆz = u(x; Wu)

z = e(y; We)

Figure 5: (continued) model φ, downstream predictor d, and feature reconstructor r, parameterized by (Φ, Wd, Wr). (c) TEAs consist of a shared forward model θ, upstream predictor u, and targetembedding function e, parameterized by (Θ, Wu, We). (d) As an additional sensitivity, F/TEAs combine the previous two, forcing intermediate representations to encode both features and targets.

clinical visits. We focus on the 8 primary quantitative biomarkers (e.g. entorhinal cortex, fusiform gyrus, hippocampus), 16 cognitive tests (e.g. ADAS11, CDR sum of boxes, mini mental state exam), as well as static demographic information (e.g. apolipoprotein E4, education level, ethnicity); we omit the remaining variables, for which the rate of missingness is over 50%. Using a precedent window, we forecast the future the evolution of the primary quantitative biomarkers and cognitive test results (all continuous variables) measured at each visit. The Medical Information Mart for Intensive Care records physiological data streams patients admitted to intensive care units after 2008. We use over 22,000 patients with a total of over 500,000 measurements (resampled at 4 hour intervals). We focus on the most frequently measured vital signs and lab tests (e.g. heart rate, oxygen saturation, respiratory rate) recorded over time (with categorical variables binarized, this gives a total of 361 variables), as well as static demographic information (e.g. admission type, gender, location, marital status); we omit the remaining variables, for which the rate of missingness is over 50%. Using a precedent window, we forecast the subsequent window of those variables. For each dataset, sequences are randomized at the patient level in order to obtain splits for training (and validation) and testing.

We implement all models using Tensorﬂow. For the linear model, each component (encoder, decoder, and predictor) consists of a single layer, no bias term, and linear activation; static (demographic) and temporal data are concatenated and ﬂattened for both features and targets. For the nonlinear case, we implement each component as an RNN using GRUs with the number of hidden layers ζ {1, 2} where the number of hidden units is equal to the temporal feature dimension, and tanh is used for activation; the dimension of the latent space is (therefore) equal to the hidden state dimension. (For even larger hidden capacities, the increased number of parameters rapidly degrades performance). Static (demographic) information is incorporated as a mapping into the initial state for recurrent cells. Training is performed using the ADAM optimizer with a learning rate of ψ {3e 5, 3e 4, 3e 3, 3e 2}. Models are trained until convergence up to a maximum of 10,000 iterations with a minibatch size of Ns {32, 64, 128}; the empirical loss is computed on the validation set every 50 iterations of training, and convergence is determined on the basis of that error. Checkpointing is implemented every 50 iterations, and the best model parameters are restored (upon convergence) for use on the testing set. For all models except Base , we allow the opportunity to select among the ℓ2-regularization coefﬁcients ν {0, 3e 5, 3e 4, 3e 3, 3e 2}. We set the strength-of-prior coefﬁcient λ = 0.5 for FEA, F/TEA, as well as all variants of TEA (however, we do provide sensitivities on λ for TEA in our experiments). For hyperparameter tuning (ζ, ψ, ν, Ns), we use cross-validation on the training set using 20 iterations of random search, selecting the setting that gives the lowest validation loss averaged across folds. For fair comparison (so as to isolate the effect of supervised representation learning over and above direct prediction), we apply the same setting chosen for REG for FEA, F/TEA, and all variants of TEA; therefore the only difference is the presence

Published as a conference paper at ICLR 2020

or absence of each autoencoding component. For each model and dataset, the experiment is repeated for a total of 10 times (each with a different random split of data into training and held-out testing sets); all results are reported as means and standard errors of each performance metric across runs.

E ADDITIONAL EXPERIMENT RESULTS

E.1 Results for Linear Models

Table 9: Extended results for TEA and comparators on linear model with UKCF (Bold indicates best)

τ Base REG FEA TEA F/TEA

PRC(I) 1 0.340 0.118* 0.370 0.101* 0.374 0.091* 0.497 0.016 0.454 0.010* 2 0.325 0.099* 0.350 0.085* 0.353 0.077* 0.459 0.017 0.419 0.012* 3 0.314 0.088* 0.337 0.076* 0.342 0.071* 0.432 0.015 0.399 0.013* 4 0.307 0.082* 0.328 0.071* 0.333 0.067* 0.413 0.010 0.386 0.009*

PRC(C) 1 0.445 0.131* 0.467 0.106* 0.499 0.110* 0.653 0.009 0.599 0.019* 2 0.418 0.099* 0.435 0.081* 0.457 0.083* 0.566 0.010 0.524 0.017* 3 0.403 0.082* 0.418 0.067* 0.436 0.069* 0.521 0.008 0.487 0.015* 4 0.398 0.073* 0.412 0.060* 0.426 0.061* 0.498 0.009 0.471 0.013*

ROC(I) 1 0.713 0.104* 0.737 0.081* 0.750 0.078 0.806 0.007 0.801 0.007 2 0.688 0.089* 0.709 0.070* 0.718 0.068* 0.771 0.007 0.765 0.009 3 0.677 0.080* 0.697 0.065* 0.706 0.066 0.750 0.007 0.749 0.009 4 0.677 0.076* 0.696 0.063 0.707 0.068 0.740 0.008 0.748 0.006*

ROC(C) 1 0.713 0.110* 0.741 0.088* 0.756 0.084* 0.829 0.006 0.810 0.008* 2 0.686 0.091* 0.707 0.072* 0.719 0.071* 0.775 0.008 0.762 0.008* 3 0.668 0.077* 0.686 0.061* 0.699 0.063 0.743 0.005 0.735 0.007* 4 0.651 0.070* 0.667 0.056* 0.679 0.057* 0.720 0.006 0.712 0.009*

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result.

Table 10: Summary results for TEA and comparators on linear model with UKCF (Bold indicates best)

Base REG FEA TEA F/TEA

PRC(I) 0.322 0.099* 0.347 0.085* 0.351 0.079* 0.450 0.035 0.414 0.028*

PRC(C) 0.416 0.100* 0.433 0.083* 0.455 0.087* 0.559 0.060 0.520 0.052

ROC(I) 0.689 0.089* 0.710 0.072* 0.720 0.073 0.767 0.026 0.766 0.023

ROC(C) 0.679 0.091* 0.700 0.075* 0.713 0.075 0.767 0.042 0.755 0.037

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 11: Extended source of gain and variants on linear model with UKCF (Bold indicates best)

τ No Joint No Staged TEA TEA(L) TEA(LP)

PRC(I) 1 0.434 0.017 0.472 0.016 0.497 0.016 0.474 0.018 0.502 0.015 2 0.408 0.016 0.439 0.016 0.459 0.017 0.443 0.018 0.463 0.016 3 0.390 0.015 0.415 0.015 0.432 0.015 0.420 0.014 0.435 0.013 4 0.377 0.014 0.399 0.011 0.413 0.010 0.402 0.010 0.415 0.009

PRC(C) 1 0.563 0.025 0.620 0.030 0.653 0.009 0.626 0.012 0.655 0.008 2 0.511 0.018 0.550 0.022 0.566 0.010 0.550 0.008 0.566 0.010 3 0.484 0.014 0.512 0.017 0.521 0.008 0.511 0.006 0.521 0.008 4 0.469 0.014 0.491 0.015 0.498 0.009 0.490 0.009 0.499 0.009

ROC(I) 1 0.779 0.008 0.799 0.007 0.806 0.007 0.797 0.007 0.809 0.005 2 0.750 0.008 0.765 0.009 0.771 0.007 0.763 0.008 0.772 0.007

Published as a conference paper at ICLR 2020

3 0.733 0.008 0.750 0.008 0.750 0.007 0.744 0.007 0.750 0.008 4 0.725 0.007 0.744 0.005 0.740 0.008 0.734 0.006 0.740 0.007

ROC(C) 1 0.799 0.012 0.821 0.011 0.829 0.006 0.823 0.005 0.830 0.005 2 0.751 0.012 0.774 0.009 0.775 0.008 0.769 0.006 0.775 0.006 3 0.723 0.010 0.746 0.007 0.743 0.005 0.737 0.005 0.742 0.005 4 0.702 0.011 0.722 0.007 0.720 0.006 0.713 0.008 0.720 0.006

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining).

Table 12: Summary source of gain and variants on linear model with UKCF (Bold indicates best)

No Joint No Staged TEA TEA(L) TEA(LP)

PRC(I) 0.402 0.026 0.431 0.031 0.450 0.035 0.435 0.031 0.454 0.036

PRC(C) 0.507 0.040 0.543 0.054 0.559 0.060 0.544 0.053 0.560 0.061

ROC(I) 0.747 0.022 0.764 0.022 0.767 0.026 0.759 0.025 0.768 0.028

ROC(C) 0.744 0.038 0.766 0.038 0.767 0.042 0.760 0.042 0.767 0.042

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

E.2 Results for Recurrent Models

Table 13: Extended results for TEA and comparators on RNN model with UKCF (Bold indicates best)

τ Base REG FEA TEA F/TEA

PRC(I) 1 0.451 0.027* 0.456 0.016* 0.448 0.026* 0.549 0.014 0.509 0.014* 2 0.417 0.024* 0.420 0.013* 0.416 0.022* 0.490 0.012 0.463 0.011* 3 0.395 0.019* 0.400 0.013* 0.395 0.020* 0.457 0.010 0.437 0.013* 4 0.380 0.017* 0.385 0.010* 0.380 0.017* 0.434 0.007 0.417 0.013*

PRC(C) 1 0.561 0.056* 0.592 0.029* 0.598 0.029* 0.695 0.010 0.685 0.015 2 0.504 0.039* 0.523 0.022* 0.527 0.021* 0.591 0.014 0.584 0.018 3 0.471 0.028* 0.488 0.018* 0.489 0.017* 0.537 0.007 0.530 0.017 4 0.453 0.023* 0.469 0.017* 0.470 0.016* 0.510 0.007 0.504 0.015

ROC(I) 1 0.788 0.018* 0.791 0.009* 0.794 0.014* 0.827 0.007 0.818 0.006* 2 0.753 0.015* 0.757 0.011* 0.758 0.017* 0.783 0.008 0.778 0.009 3 0.736 0.013* 0.741 0.012* 0.740 0.016* 0.760 0.007 0.757 0.010 4 0.725 0.012* 0.731 0.011* 0.727 0.014* 0.748 0.008 0.744 0.010

ROC(C) 1 0.794 0.022* 0.809 0.015* 0.808 0.012* 0.838 0.007 0.834 0.007 2 0.750 0.017* 0.761 0.010* 0.761 0.009* 0.782 0.007 0.781 0.008 3 0.723 0.013* 0.733 0.007* 0.735 0.010* 0.752 0.006 0.751 0.009 4 0.699 0.009* 0.709 0.009* 0.711 0.010* 0.726 0.006 0.724 0.008

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result.

Table 14: Summary results for TEA and comparators on RNN model with UKCF (Bold indicates best)

Base REG FEA TEA F/TEA

PRC(I) 0.411 0.035* 0.415 0.030* 0.410 0.033* 0.483 0.045 0.457 0.037

PRC(C) 0.497 0.057* 0.518 0.052* 0.521 0.054* 0.583 0.072 0.576 0.071

ROC(I) 0.750 0.028* 0.755 0.025 0.755 0.029 0.779 0.031 0.774 0.030

ROC(C) 0.742 0.038 0.753 0.039 0.754 0.037 0.774 0.042 0.772 0.042

Published as a conference paper at ICLR 2020

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 15: Extended source of gain and variants on RNN model with UKCF (Bold indicates best)

τ No Joint No Staged TEA TEA(L) TEA(LP)

PRC(I) 1 0.511 0.014 0.468 0.015 0.549 0.014 0.553 0.009 0.545 0.011 2 0.461 0.013 0.429 0.011 0.490 0.012 0.492 0.013 0.487 0.012 3 0.434 0.016 0.407 0.011 0.457 0.010 0.457 0.013 0.455 0.014 4 0.414 0.009 0.392 0.011 0.434 0.007 0.432 0.008 0.433 0.009

PRC(C) 1 0.682 0.011 0.633 0.032 0.695 0.010 0.697 0.010 0.695 0.012 2 0.581 0.012 0.549 0.025 0.591 0.014 0.589 0.013 0.592 0.011 3 0.530 0.010 0.506 0.018 0.537 0.007 0.534 0.009 0.538 0.009 4 0.504 0.009 0.484 0.015 0.510 0.007 0.505 0.008 0.509 0.010

ROC(I) 1 0.816 0.004 0.795 0.008 0.827 0.007 0.825 0.005 0.822 0.010 2 0.774 0.010 0.759 0.007 0.783 0.008 0.782 0.008 0.778 0.007 3 0.749 0.010 0.744 0.008 0.760 0.007 0.758 0.006 0.758 0.006 4 0.732 0.008 0.735 0.008 0.748 0.008 0.739 0.007 0.745 0.004

ROC(C) 1 0.830 0.008 0.816 0.011 0.838 0.007 0.839 0.008 0.839 0.007 2 0.775 0.010 0.767 0.009 0.782 0.007 0.784 0.010 0.782 0.005 3 0.743 0.009 0.735 0.007 0.752 0.006 0.751 0.010 0.751 0.007 4 0.721 0.006 0.712 0.007 0.726 0.006 0.724 0.007 0.726 0.006

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining).

Table 16: Summary source of gain and variants on RNN model with UKCF (Bold indicates best)

No Joint No Staged TEA TEA(L) TEA(LP)

PRC(I) 0.455 0.039 0.424 0.031 0.483 0.045 0.483 0.047 0.480 0.044

PRC(C) 0.574 0.069 0.543 0.061 0.583 0.072 0.581 0.074 0.583 0.072

ROC(I) 0.768 0.033 0.758 0.024 0.779 0.031 0.776 0.033 0.776 0.030

ROC(C) 0.767 0.042 0.758 0.040 0.774 0.042 0.774 0.044 0.774 0.043

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 17: Extended results for TEA and comparators on RNN model with ADNI (Bold indicates best)

τ Base REG FEA TEA F/TEA

MSE(B) 1 0.095 0.014* 0.088 0.010* 0.082 0.007* 0.057 0.007 0.065 0.008* 2 0.097 0.015* 0.089 0.010* 0.084 0.008* 0.057 0.007 0.066 0.007* 3 0.100 0.015* 0.092 0.010* 0.087 0.008* 0.059 0.007 0.068 0.007* 4 0.104 0.016* 0.095 0.011* 0.091 0.008* 0.061 0.007 0.071 0.008* 5 0.105 0.017* 0.097 0.012* 0.093 0.008* 0.062 0.008 0.073 0.008* 6 0.109 0.017* 0.100 0.013* 0.097 0.009* 0.065 0.008 0.076 0.009* 7 0.112 0.019* 0.103 0.014* 0.101 0.011* 0.068 0.009 0.080 0.009* 8 0.115 0.021* 0.106 0.016* 0.105 0.013* 0.072 0.011 0.083 0.010*

MSE(C) 1 0.275 0.013* 0.270 0.013* 0.265 0.011* 0.239 0.015 0.243 0.013 2 0.300 0.015* 0.295 0.013* 0.290 0.012* 0.265 0.014 0.273 0.014 3 0.323 0.018* 0.320 0.015* 0.314 0.013* 0.287 0.014 0.297 0.015 4 0.358 0.019* 0.354 0.018* 0.352 0.017* 0.322 0.015 0.333 0.018 5 0.371 0.024* 0.370 0.023* 0.367 0.023* 0.341 0.019 0.350 0.021 6 0.393 0.033 0.393 0.032 0.391 0.034 0.366 0.026 0.374 0.028 7 0.417 0.043 0.419 0.040 0.417 0.044 0.394 0.035 0.399 0.038

Published as a conference paper at ICLR 2020

8 0.453 0.058 0.455 0.054 0.454 0.062 0.430 0.048 0.435 0.057

MSE evaluations reported separately for targets representing quantitative biomarkers (B) and cognitive tests (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result.

Table 18: Summary results for TEA and comparators on RNN model with ADNI (Bold indicates best)

Base REG FEA TEA F/TEA

MSE(B) 0.105 0.018* 0.096 0.014* 0.092 0.012* 0.063 0.010 0.073 0.010*

MSE(C) 0.361 0.064 0.360 0.066 0.356 0.068 0.330 0.066 0.338 0.067

MSE evaluations reported separately for targets representing quantitative biomarkers (B) and cognitive tests (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 19: Extended source of gain and variants on RNN model with ADNI (Bold indicates best)

τ No Joint No Staged TEA TEA(L) TEA(LP)

MSE(B) 1 0.081 0.011 0.098 0.018 0.057 0.007 0.049 0.009 0.057 0.009 2 0.084 0.011 0.098 0.018 0.057 0.007 0.051 0.010 0.058 0.008 3 0.087 0.011 0.101 0.019 0.059 0.007 0.054 0.011 0.059 0.008 4 0.090 0.011 0.105 0.019 0.061 0.007 0.056 0.011 0.062 0.008 5 0.092 0.011 0.106 0.021 0.062 0.008 0.059 0.011 0.064 0.009 6 0.097 0.012 0.110 0.023 0.065 0.008 0.063 0.011 0.068 0.010 7 0.100 0.013 0.113 0.025 0.068 0.009 0.066 0.010 0.072 0.011 8 0.104 0.014 0.117 0.027 0.072 0.011 0.070 0.011 0.076 0.013

MSE(C) 1 0.258 0.016 0.274 0.017 0.239 0.015 0.231 0.020 0.241 0.016 2 0.285 0.016 0.297 0.017 0.265 0.014 0.258 0.021 0.266 0.018 3 0.311 0.017 0.321 0.018 0.287 0.014 0.282 0.021 0.287 0.018 4 0.346 0.019 0.356 0.020 0.322 0.015 0.319 0.021 0.321 0.022 5 0.363 0.024 0.373 0.024 0.341 0.019 0.337 0.024 0.338 0.026 6 0.389 0.033 0.397 0.031 0.366 0.026 0.366 0.030 0.362 0.033 7 0.416 0.041 0.424 0.040 0.394 0.035 0.401 0.043 0.390 0.044 8 0.454 0.059 0.462 0.053 0.430 0.048 0.447 0.064 0.427 0.063

MSE evaluations reported separately for targets representing quantitative biomarkers (B) and cognitive tests (C). The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining).

Table 20: Summary source of gain and variants on RNN model with ADNI (Bold indicates best)

No Joint No Staged TEA TEA(L) TEA(LP)

MSE(B) 0.092 0.014 0.106 0.022 0.063 0.010 0.058 0.012 0.064 0.012

MSE(C) 0.353 0.070 0.363 0.067 0.330 0.066 0.330 0.076 0.329 0.068

MSE evaluations reported separately for targets representing quantitative biomarkers (B) and cognitive tests (C). The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 21: Extended results for TEA and comparators on RNN model with MIMIC (Bold indicates best)

τ Base REG FEA TEA F/TEA

PRC 1 0.159 0.034* 0.159 0.022* 0.162 0.036* 0.293 0.031 0.193 0.023* 2 0.148 0.028* 0.149 0.018* 0.150 0.030* 0.254 0.025 0.174 0.018* 3 0.139 0.024* 0.141 0.015* 0.142 0.027* 0.230 0.021 0.162 0.015* 4 0.133 0.021* 0.135 0.012* 0.135 0.024* 0.214 0.018 0.153 0.012* 5 0.129 0.019* 0.130 0.011* 0.130 0.022* 0.203 0.015 0.147 0.011*

ROC 1 0.699 0.049* 0.704 0.028* 0.709 0.060* 0.801 0.018 0.745 0.021*

Published as a conference paper at ICLR 2020

2 0.701 0.044* 0.707 0.025* 0.705 0.050* 0.778 0.015 0.740 0.018* 3 0.690 0.041* 0.696 0.024* 0.693 0.046* 0.758 0.016 0.726 0.019* 4 0.681 0.038* 0.688 0.023* 0.684 0.042* 0.745 0.015 0.715 0.019* 5 0.679 0.037* 0.685 0.023* 0.680 0.043* 0.736 0.012 0.713 0.019*

MSE 1 0.141 0.007 0.140 0.006 0.138 0.010 0.137 0.008 0.139 0.007 2 0.159 0.010 0.159 0.007 0.160 0.008 0.154 0.009 0.162 0.007* 3 0.156 0.009 0.155 0.007 0.156 0.008 0.158 0.008 0.158 0.008 4 0.154 0.008 0.153 0.008 0.153 0.008 0.153 0.009 0.155 0.009 5 0.154 0.010 0.152 0.010 0.152 0.010 0.150 0.011 0.155 0.010

The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result.

Table 22: Summary results for TEA and comparators on RNN model with MIMIC (Bold indicates best)

Base REG FEA TEA F/TEA

PRC 0.142 0.028* 0.143 0.019* 0.144 0.030* 0.239 0.039 0.166 0.023*

ROC 0.690 0.043* 0.696 0.026* 0.694 0.050* 0.763 0.028 0.728 0.023*

MSE 0.153 0.011 0.152 0.010 0.152 0.012 0.150 0.012 0.154 0.011

The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 23: Extended source of gain and variants on RNN model with MIMIC (Bold indicates best)

τ No Joint No Staged TEA TEA(L) TEA(LP)

PRC 1 0.216 0.044 0.194 0.020 0.293 0.031 0.310 0.047 0.280 0.033 2 0.194 0.035 0.175 0.017 0.254 0.025 0.265 0.035 0.242 0.026 3 0.178 0.030 0.163 0.013 0.230 0.021 0.239 0.030 0.221 0.022 4 0.168 0.027 0.154 0.011 0.214 0.018 0.222 0.026 0.206 0.020 5 0.160 0.024 0.148 0.010 0.203 0.015 0.210 0.023 0.195 0.018

ROC 1 0.756 0.042 0.742 0.022 0.801 0.018 0.807 0.025 0.791 0.018 2 0.741 0.031 0.738 0.021 0.778 0.015 0.783 0.017 0.773 0.012 3 0.726 0.031 0.724 0.019 0.758 0.016 0.761 0.019 0.756 0.013 4 0.715 0.030 0.715 0.019 0.745 0.015 0.747 0.017 0.742 0.011 5 0.710 0.031 0.711 0.018 0.736 0.012 0.741 0.019 0.736 0.014

MSE 1 0.138 0.008 0.137 0.007 0.137 0.008 0.136 0.006 0.137 0.008 2 0.158 0.007 0.156 0.012 0.154 0.009 0.153 0.007 0.154 0.007 3 0.155 0.009 0.156 0.010 0.158 0.008 0.156 0.010 0.158 0.008 4 0.152 0.008 0.153 0.009 0.153 0.009 0.151 0.011 0.154 0.008 5 0.151 0.009 0.150 0.009 0.150 0.011 0.147 0.011 0.150 0.008

The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining).

Table 24: Summary source of gain and variants on RNN model with MIMIC (Bold indicates best)

No Joint No Staged TEA TEA(L) TEA(LP)

PRC 0.183 0.038 0.167 0.022 0.239 0.039 0.249 0.049 0.229 0.039

ROC 0.730 0.038 0.726 0.023 0.763 0.028 0.768 0.031 0.759 0.025

MSE 0.151 0.011 0.150 0.012 0.150 0.012 0.149 0.012 0.151 0.011

The "No Joint" setting isolates the beneﬁt from staged training only (analogous to basic unsupervised pretraining, though using targets); the "No Staged" setting isolates the beneﬁt from joint training only (without pretraining). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

E.3 Results for Sensitivities

Published as a conference paper at ICLR 2020

Table 25: Extended ν-Sensitivities for REG on linear model with UKCF (Bold indicates best)

τ ν = 0 ν = 3e 5 ν = 3e 4 ν = 3e 3 ν = 3e 2

PRC(I) 1 0.340 0.118 0.355 0.114 0.370 0.101 0.320 0.034 0.163 0.003 2 0.325 0.099 0.338 0.096 0.350 0.085 0.309 0.026 0.176 0.004 3 0.314 0.088 0.327 0.086 0.337 0.076 0.300 0.024 0.182 0.005 4 0.307 0.082 0.318 0.080 0.328 0.071 0.293 0.023 0.184 0.004

PRC(C) 1 0.445 0.131 0.460 0.125 0.467 0.106 0.426 0.034 0.240 0.004 2 0.418 0.099 0.428 0.095 0.435 0.081 0.409 0.025 0.260 0.005 3 0.403 0.082 0.412 0.078 0.418 0.067 0.399 0.021 0.272 0.006 4 0.398 0.073 0.406 0.070 0.412 0.060 0.397 0.019 0.281 0.007

ROC(I) 1 0.713 0.104 0.724 0.100 0.737 0.081 0.715 0.022 0.527 0.006 2 0.688 0.089 0.697 0.086 0.709 0.070 0.693 0.022 0.532 0.008 3 0.677 0.080 0.686 0.078 0.697 0.065 0.683 0.021 0.543 0.007 4 0.677 0.076 0.686 0.074 0.696 0.063 0.681 0.019 0.555 0.007

ROC(C) 1 0.713 0.110 0.727 0.105 0.741 0.088 0.716 0.025 0.512 0.010 2 0.686 0.091 0.696 0.086 0.707 0.072 0.686 0.022 0.523 0.008 3 0.668 0.077 0.678 0.074 0.686 0.061 0.668 0.021 0.530 0.013 4 0.651 0.070 0.660 0.066 0.667 0.056 0.652 0.018 0.527 0.015

The ν coefﬁcient controls the strength of ℓ2-regularization applied on top of the original loss function minimized. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C).

Table 26: Summary ν-Sensitivities for REG on linear model with UKCF (Bold indicates best)

ν = 0 ν = 3e 5 ν = 3e 4 ν = 3e 3 ν = 3e 2

PRC(I) 0.322 0.099 0.335 0.096 0.347 0.085 0.305 0.029 0.176 0.009

PRC(C) 0.416 0.100 0.426 0.097 0.433 0.083 0.408 0.028 0.263 0.016

ROC(I) 0.689 0.089 0.698 0.087 0.710 0.072 0.693 0.025 0.540 0.013

ROC(C) 0.679 0.091 0.690 0.087 0.700 0.075 0.681 0.032 0.523 0.013

The ν coefﬁcient controls the strength of ℓ2-regularization applied on top of the original loss function minimized. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 27: Extended ν-Sensitivities for TEA on linear model with UKCF (Bold indicates best)

τ ν = 0 ν = 3e 5 ν = 3e 4 ν = 3e 3 ν = 3e 2

PRC(I) 1 0.484 0.020 0.489 0.019 0.497 0.016 0.442 0.005 0.174 0.004 2 0.450 0.016 0.453 0.016 0.459 0.017 0.414 0.007 0.186 0.005 3 0.424 0.014 0.426 0.013 0.432 0.015 0.394 0.009 0.192 0.005 4 0.405 0.012 0.407 0.010 0.413 0.010 0.381 0.008 0.193 0.004

PRC(C) 1 0.641 0.021 0.644 0.019 0.653 0.009 0.612 0.009 0.276 0.007 2 0.561 0.013 0.562 0.011 0.566 0.010 0.544 0.007 0.293 0.008 3 0.519 0.008 0.519 0.007 0.521 0.008 0.508 0.005 0.302 0.008 4 0.495 0.008 0.496 0.007 0.498 0.009 0.489 0.007 0.309 0.009

ROC(I) 1 0.800 0.015 0.803 0.015 0.806 0.007 0.779 0.007 0.555 0.005 2 0.765 0.009 0.767 0.011 0.771 0.007 0.751 0.007 0.557 0.007 3 0.746 0.008 0.747 0.008 0.750 0.007 0.735 0.007 0.565 0.006 4 0.736 0.006 0.737 0.007 0.740 0.008 0.727 0.005 0.575 0.006

ROC(C) 1 0.825 0.011 0.826 0.010 0.829 0.006 0.819 0.006 0.560 0.011 2 0.772 0.010 0.774 0.009 0.775 0.008 0.771 0.006 0.564 0.009 3 0.742 0.004 0.744 0.004 0.743 0.005 0.741 0.004 0.566 0.012 4 0.718 0.006 0.720 0.006 0.720 0.006 0.717 0.008 0.561 0.015

The ν coefﬁcient controls the strength of ℓ2-regularization applied on top of the original loss function minimized. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C).

Published as a conference paper at ICLR 2020

Table 28: Summary ν-Sensitivities for TEA on linear model with UKCF (Bold indicates best)

ν = 0 ν = 3e 5 ν = 3e 4 ν = 3e 3 ν = 3e 2

PRC(I) 0.441 0.033 0.444 0.034 0.450 0.035 0.408 0.024 0.186 0.009

PRC(C) 0.554 0.057 0.555 0.058 0.559 0.060 0.538 0.047 0.295 0.015

ROC(I) 0.762 0.026 0.763 0.028 0.767 0.026 0.748 0.021 0.563 0.010

ROC(C) 0.764 0.041 0.766 0.041 0.767 0.042 0.762 0.039 0.563 0.012

The ν coefﬁcient controls the strength of ℓ2-regularization applied on top of the original loss function minimized. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 29: Extended λ-Sensitivities for TEA on linear model with UKCF (Bold indicates best)

τ λ = 0 λ = 0.01 λ = 0.1 λ = 0.5 λ = 0.9 λ = 0.99 λ = 1

PRC(I) 1 0.370 0.101 0.383 0.106 0.412 0.091 0.472 0.016 0.461 0.012 0.327 0.011 0.150 0.008 2 0.350 0.085 0.361 0.090 0.386 0.076 0.439 0.016 0.431 0.012 0.323 0.011 0.162 0.008 3 0.337 0.076 0.347 0.081 0.368 0.068 0.415 0.015 0.410 0.009 0.316 0.012 0.168 0.007 4 0.328 0.071 0.337 0.075 0.357 0.064 0.399 0.011 0.395 0.008 0.307 0.011 0.170 0.007

PRC(C) 1 0.467 0.106 0.481 0.110 0.528 0.104 0.620 0.030 0.620 0.016 0.433 0.012 0.236 0.012 2 0.435 0.081 0.445 0.084 0.481 0.079 0.550 0.022 0.553 0.013 0.427 0.010 0.249 0.010 3 0.418 0.067 0.427 0.070 0.456 0.064 0.512 0.017 0.516 0.009 0.421 0.009 0.259 0.011 4 0.412 0.060 0.420 0.062 0.445 0.057 0.491 0.015 0.494 0.009 0.415 0.009 0.266 0.011

ROC(I) 1 0.737 0.081 0.742 0.089 0.764 0.067 0.799 0.007 0.791 0.004 0.708 0.008 0.499 0.013 2 0.709 0.070 0.713 0.077 0.733 0.058 0.765 0.009 0.760 0.008 0.694 0.007 0.502 0.014 3 0.697 0.065 0.701 0.071 0.719 0.054 0.750 0.008 0.746 0.008 0.690 0.006 0.500 0.014 4 0.696 0.063 0.699 0.068 0.715 0.053 0.744 0.005 0.741 0.006 0.690 0.007 0.501 0.015

ROC(C) 1 0.741 0.088 0.747 0.092 0.775 0.075 0.821 0.011 0.819 0.006 0.725 0.014 0.493 0.034 2 0.707 0.072 0.712 0.076 0.735 0.061 0.774 0.009 0.774 0.007 0.704 0.012 0.496 0.027 3 0.686 0.061 0.691 0.067 0.711 0.052 0.746 0.007 0.745 0.003 0.690 0.011 0.497 0.028 4 0.667 0.056 0.672 0.059 0.689 0.048 0.722 0.007 0.721 0.005 0.675 0.014 0.497 0.025

The λ coefﬁcient controls the strength of prior i.e. the tradeoff between the prediction and reconstruction loss. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C).

Table 30: Summary λ-Sensitivities for TEA on linear model with UKCF (Bold indicates best)

λ = 0 λ = 0.01 λ = 0.1 λ = 0.5 λ = 0.9 λ = 0.99 λ = 1

PRC(I) 0.347 0.085 0.357 0.090 0.381 0.078 0.431 0.031 0.424 0.027 0.318 0.013 0.162 0.011

PRC(C) 0.433 0.083 0.443 0.087 0.477 0.084 0.543 0.054 0.546 0.050 0.424 0.012 0.252 0.016

ROC(I) 0.710 0.072 0.714 0.078 0.733 0.061 0.764 0.022 0.759 0.020 0.695 0.010 0.501 0.014

ROC(C) 0.700 0.075 0.705 0.080 0.727 0.068 0.766 0.038 0.765 0.037 0.698 0.022 0.496 0.029

The λ coefﬁcient controls the strength of prior i.e. the tradeoff between the prediction and reconstruction loss. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 31: Extended N-Sensitivities for REG on linear model with UKCF (Bold indicates best)

τ N 1% N 5% N 20% N 50% N 100%

PRC(I) 1 0.160 0.010 0.176 0.024 0.199 0.044 0.325 0.114 0.370 0.101 2 0.172 0.009 0.187 0.021 0.207 0.037 0.312 0.095 0.350 0.085 3 0.180 0.011 0.192 0.020 0.209 0.032 0.303 0.085 0.337 0.076 4 0.182 0.009 0.193 0.020 0.208 0.028 0.295 0.078 0.328 0.071

PRC(C) 1 0.246 0.011 0.268 0.034 0.293 0.046 0.421 0.119 0.467 0.106 2 0.263 0.011 0.283 0.027 0.304 0.036 0.401 0.091 0.435 0.081 3 0.275 0.013 0.292 0.028 0.313 0.032 0.390 0.076 0.418 0.067 4 0.286 0.013 0.302 0.025 0.319 0.029 0.384 0.066 0.412 0.060

ROC(I) 1 0.512 0.024 0.553 0.046 0.598 0.057 0.705 0.090 0.737 0.081 2 0.516 0.025 0.557 0.040 0.594 0.049 0.681 0.077 0.709 0.070

Published as a conference paper at ICLR 2020

3 0.521 0.026 0.549 0.037 0.590 0.045 0.671 0.071 0.697 0.065 4 0.519 0.024 0.546 0.037 0.590 0.039 0.669 0.068 0.696 0.063

ROC(C) 1 0.507 0.026 0.539 0.049 0.591 0.054 0.702 0.101 0.741 0.088 2 0.520 0.024 0.546 0.040 0.587 0.041 0.675 0.082 0.707 0.072 3 0.526 0.023 0.549 0.038 0.586 0.036 0.659 0.070 0.686 0.061 4 0.528 0.027 0.550 0.039 0.581 0.034 0.642 0.062 0.667 0.056

The proportion of data N used is randomly restricted, showing performance under various levels of data scarcity. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C).

Table 32: Summary N-Sensitivities for REG on linear model with UKCF (Bold indicates best)

N 1% N 5% N 20% N 50% N 100%

PRC(I) 0.173 0.013 0.187 0.022 0.206 0.036 0.309 0.094 0.347 0.085

PRC(C) 0.267 0.019 0.286 0.031 0.307 0.038 0.399 0.092 0.433 0.083

ROC(I) 0.517 0.025 0.551 0.041 0.593 0.048 0.682 0.078 0.710 0.072

ROC(C) 0.520 0.026 0.546 0.042 0.586 0.042 0.669 0.083 0.700 0.075

The proportion of data N used is randomly restricted, showing performance under various levels of data scarcity. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Table 33: Extended N-Sensitivities for TEA on linear model with UKCF (Bold indicates best)

τ N 1% N 5% N 20% N 50% N 100%

PRC(I) 1 0.199 0.016 0.342 0.040 0.453 0.020 0.486 0.017 0.497 0.016 2 0.205 0.013 0.336 0.031 0.421 0.017 0.448 0.015 0.459 0.017 3 0.212 0.018 0.322 0.027 0.398 0.015 0.420 0.014 0.432 0.015 4 0.210 0.016 0.308 0.025 0.381 0.011 0.402 0.011 0.413 0.010

PRC(C) 1 0.287 0.021 0.470 0.060 0.610 0.017 0.642 0.012 0.653 0.009 2 0.292 0.018 0.445 0.041 0.535 0.014 0.557 0.012 0.566 0.010 3 0.302 0.014 0.424 0.034 0.495 0.011 0.514 0.009 0.521 0.008 4 0.311 0.017 0.417 0.030 0.478 0.014 0.493 0.010 0.498 0.009

ROC(I) 1 0.570 0.022 0.698 0.027 0.776 0.008 0.797 0.007 0.806 0.007 2 0.565 0.027 0.684 0.019 0.744 0.010 0.763 0.007 0.771 0.007 3 0.568 0.027 0.663 0.021 0.723 0.011 0.740 0.009 0.750 0.007 4 0.564 0.026 0.652 0.020 0.707 0.010 0.729 0.010 0.740 0.008

ROC(C) 1 0.581 0.017 0.715 0.032 0.795 0.009 0.822 0.007 0.829 0.006 2 0.565 0.015 0.684 0.019 0.745 0.006 0.768 0.007 0.775 0.008 3 0.570 0.015 0.668 0.018 0.718 0.007 0.737 0.005 0.743 0.005 4 0.570 0.020 0.653 0.015 0.698 0.008 0.715 0.008 0.720 0.006

The proportion of data N used is randomly restricted, showing performance under various levels of data scarcity. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C).

Table 34: Summary N-Sensitivities for TEA on linear model with UKCF (Bold indicates best)

N 1% N 5% N 20% N 50% N 100%

PRC(I) 0.207 0.017 0.327 0.034 0.413 0.031 0.439 0.035 0.450 0.035

PRC(C) 0.298 0.020 0.439 0.047 0.530 0.053 0.551 0.058 0.559 0.060

ROC(I) 0.567 0.026 0.674 0.029 0.738 0.027 0.757 0.027 0.767 0.026

ROC(C) 0.572 0.018 0.680 0.032 0.739 0.038 0.760 0.041 0.767 0.042

The proportion of data N used is randomly restricted, showing performance under various levels of data scarcity. PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

E.4 Contrived Negative Example

Published as a conference paper at ICLR 2020

Here with give a contrived example where the goals of prediction and reconstruction are directly at odds with each other. Speciﬁcally, we show a setup where the and part of compact and predictable representations is impossible that is, a compact target-representation that is more reconstructive ends up being less predictable. Consider the following (true) data generating process: Latent variables p, u each consist of 10 independent dimensions, where pi N(0, 1) and ui N(0, 1). Feature vectors x are also of length 10, and are linear in p. Target vectors y = [y P , y U ] are of length 50, where y P is of length 10 and linear in p and y U is of length 40 and linear in u. See Figure 7. Note that the input features are inadequate for predicting all targets; our choice of lettering denotes elements that are in principle predictable ( P ), and those that are in principle unpredictable ( U ).

dim = 10 dim = 10

Figure 6: Data generating process for synthetic example. Latent vectors p, u linearly generate feature vectors x and target vectors y. In this situation, y P is in principle predictable from x, while y U is impossible to predict.

Suppose this data generating process is unknown to us. First, consider what happens with direct prediction: A linear model would learn to predict y P well, while predictions of y U would be no better than random. So far, so good. Now given the feature and target dimensions, consider the (not unreasonable) choice of TEAs with a latent dimension of 10. This is an obvious problem: During reconstruction, we naturally get more bang for our buck by encoding more of the (highly compressible) y U instead of y P; yet y U is entirely useless to encode, as it is not predictable from inputs anyway. Reconstructing well is therefore directly at odds with predicting well. This is certainly an extremely contrived scenario; nevertheless, without sufﬁcient domain knowledge, it serves as a caveat that as with feature-embedding paradigms target-embedding is only as good as its assumptions.

Figure 7: Synthetic scenario where prediction is directly at odds with reconstruction. The prior (that we can leverage compact and predictable representations of targets) is hugely incorrect in this case; as a result, not only is target-autoencoding not beneﬁcial it is positively harmful. For TEAs, we observe that as the strength-of-prior coefﬁcient λ increases, the overall prediction error actually increases (while the reconstruction error decreases).

Table 35: Results for TEA and comparators on linear model for negative example (Bold indicates best)

Base REG FEA TEA F/TEA

MSE 0.266 0.045 0.266 0.045 0.266 0.045 0.285 0.047 0.280 0.045

MSE(U) 0.333 0.056 0.333 0.056 0.333 0.056 0.334 0.056 0.334 0.056

MSE(P) 0.000 0.000 0.000 0.000 0.000 0.000 0.087 0.028 0.061 0.022

MSE metrics are further reported separately for targets that are in principle predictable (P) and unpredictable (U).

Published as a conference paper at ICLR 2020

Table 36: λ-Sensitivities for TEA on linear model for negative example (Bold indicates best)

λ = 0 λ = 0.01 λ = 0.1 λ = 0.5 λ = 0.9

MSE 0.266 0.045 0.267 0.045 0.271 0.046 0.285 0.047 0.292 0.051

MSE(U) 0.333 0.056 0.333 0.056 0.334 0.056 0.334 0.056 0.334 0.056

MSE(P) 0.000 0.000 0.003 0.005 0.022 0.009 0.087 0.028 0.125 0.063

MSE metrics are further reported separately for targets that are in principle predictable (P) and unpredictable (U).

E.5 Results from Open Discussion

One can also ask the (purely empirical) question of how much each model degrades on out-ofdistribution data without additional training to ﬁne-tune the model to the new data. In this context, we actually have no reason to expect TEAs to degrade any more or less than comparators. For thoroughness, we show an additional experiment as an example for sensitivity analysis (using UKCF) as follows: Each model is trained (only) on male patients and tested (only) on female patients, and vice versa. The average results on held-out samples from in-distribution data and out-of-distribution data then allow us to compute the net degradation (i.e. negative difference), which is reported below. While TEAs individually perform better overall on both in-distribution and out-of-distribution samples, none of the differences in the amounts of degradation between models are statistically signiﬁcant:

Table 37: Performance degradation for TEA and comparators on linear model with UKCF (Bold indicates best)

Base REG FEA TEA F/TEA

ROC(I) 0.019 0.015 0.020 0.014 0.019 0.015 0.020 0.016 0.017 0.014 ROC(C) 0.025 0.015 0.029 0.015 0.024 0.014 0.013 0.020 0.019 0.018

PRC(I) 0.022 0.020 0.018 0.021 0.021 0.022 0.033 0.022 0.027 0.022 PRC(C) 0.026 0.021 0.029 0.018 0.026 0.019 0.018 0.023 0.021 0.019

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.

Finally, given the staged training in Algorithm 1, it should be clear that the order cannot be changed. Stage 2 requires the encoder to already be trained to provide the requisite embeddings, so it must be preceded by Stage 1. Therefore the only relevant possibilities are: (1) Stages 1-2 by themselves, without Stage 3; this is simply the No Joint setting. (2) Stage 3 by itself, without Stages 1-2; this is simply the No Staged setting. (3) None of the stages altogether; this is simply the Neither setting. (4) Stages 1, 2, and 3 in order; this is simply Algorithm 1 itself. The only remaining possibility is to have Stage 3 precede Stages 1-2. This makes little sense, since when the reconstruction loss is trained by itself it is likely to undo the result of joint training. For thoroughness, we run an additional sensitivity experiment (using UKCF) to conﬁrm this. The following corresponds to the left half of Table 4, with the additional column on the right (and the other columns labeled to reﬂect the training stages). Verifying our intuitions, the setting 3-1-2 behaves almost identically to the setting 1-2 :

Table 38: Performance by training stages for TEA on linear model with UKCF (Bold indicates best); column headers indicate the sequence of training stages executed (note that 1-2-3 simply corresponds to Algorithm 1)

None 1-2 3 1-2-3 3-1-2

ROC(I) 0.710 0.072 0.747 0.022 0.764 0.022 0.767 0.026 0.749 0.022 ROC(C) 0.700 0.075 0.744 0.038 0.766 0.038 0.767 0.042 0.747 0.037

PRC(I) 0.347 0.085 0.402 0.026 0.431 0.031 0.450 0.035 0.404 0.027 PRC(C) 0.433 0.083 0.507 0.040 0.543 0.054 0.559 0.060 0.512 0.042

PRC and ROC evaluations are reported separately for targets representing infections (I) and comorbidities (C). The two-sample t-test for a difference in means is conducted on the results. An asterisk next to the comparator result is used to indicate a statistically signiﬁcant difference in means (p-value < 0.05) relative to the TEA result. Results are grouped over the temporal axis; note that the variance between splits is an artifact of this grouping.