# multitask_spectral_learning_of_weighted_automata__185fc0a4.pdf

Multitask Spectral Learning of Weighted Automata

Guillaume Rabusseau

Mc Gill University Borja Balle

Amazon Research Cambridge Joelle Pineau

Mc Gill University

We consider the problem of estimating multiple related functions computed by weighted automata (WFA). We ﬁrst present a natural notion of relatedness between WFAs by considering to which extent several WFAs can share a common underlying representation. We then introduce the novel model of vector-valued WFA which conveniently helps us formalize this notion of relatedness. Finally, we propose a spectral learning algorithm for vector-valued WFAs to tackle the multitask learning problem. By jointly learning multiple tasks in the form of a vector-valued WFA, our algorithm enforces the discovery of a representation space shared between tasks. The beneﬁts of the proposed multitask approach are theoretically motivated and showcased through experiments on both synthetic and real world datasets.

1 Introduction

One common task in machine learning consists in estimating an unknown function f : X Y from a training sample of input-output data {(xi, yi)}N i=1 where each yi f(xi) is a (possibly noisy) estimate of f(xi). In multitask learning, the learner is given several such learning tasks f1, , fm. It has been shown, both experimentally and theoretically, that learning related tasks simultaneously can lead to better performances relative to learning each task independently (see e.g. [1, 7], and references therein). Multitask learning has proven particularly useful when few data points are available for each task, or when it is difﬁcult or costly to collect data for a target task while much data is available for related tasks (see e.g. [28] for an example in healthcare). In this paper, we propose a multitask learning algorithm for the case where the input space X consists of sequence data.

Many tasks in natural language processing, computational biology, or reinforcement learning, rely on estimating functions mapping sequences of observations to real numbers: e.g. inferring probability distributions over sentences in language modeling or learning the dynamics of a model of the environment in reinforcement learning. In this case, the function f to infer from training data is deﬁned over the set Σ of strings built on a ﬁnite alphabet Σ. Weighted ﬁnite automata (WFA) are ﬁnite state machines that allow one to succinctly represent such functions. In particular, WFAs can compute any probability distribution deﬁned by a hidden Markov model (HMM) [13] and can model the transition and observation behavior of partially observable Markov decision processes [26]. A recent line of work has led to the development of spectral methods for learning HMMs [17], WFAs [2, 4] and related models, offering an alternative to EM based algorithms with the beneﬁts of being computationally efﬁcient and providing consistent estimators. Spectral learning algorithms have led to competitive results in the ﬁelds of natural language processing [12, 3] and robotics [8].

We consider the problem of multitask learning for WFAs. As a motivational example, consider a natural language modeling task where one needs to make predictions in different contexts (e.g. online chat vs. newspaper articles) and has access to datasets in each of them; it is natural to expect that basic grammar is shared across the datasets and that one could beneﬁt from simultaneously

guillaume.rabusseau@mail.mcgill.ca pigem@amazon.co.uk jpineau@cs.mcgill.ca

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

learning these tasks. The notion of relatedness between tasks can be expressed in different ways; one common assumption in multitask learning is that the multiple tasks share a common underlying representation [6, 11]. In this paper, we present a natural notion of shared representation between functions deﬁned over strings and we propose a learning algorithm that encourages the discovery of this shared representation. Intuitively, our notion of relatedness captures to which extent several functions can be computed by WFAs sharing a joint forward feature map. In order to formalize this notion of relatedness, we introduce the novel model of vector-valued WFA (vv-WFA) which generalizes WFAs to vector-valued functions and offer a natural framework to formalize the multitask learning problem. Given m tasks f1, , fm : Σ R, we consider the function f = [f1, , fm] : Σ Rm whose output for a given input string x is the m-dimensional vector having entries fi(x) for i = 1, , m. We show that the notion of minimal vv-WFA computing f exactly captures our notion of relatedness between tasks and we prove that the dimension of such a minimal representation is equal to the rank of a ﬂattening of the Hankel tensor of f (Theorem 3). Leveraging this result, we design a spectral learning algorithm for vv-WFAs which constitutes a sound multitask learning algorithm for WFAs: by learning f in the form of a vv-WFA, rather than independently learning a WFA for each task fi, we implicitly enforce the discovery of a joint feature space shared among all tasks. After giving a theoretical insight on the beneﬁts of this multitask approach (by leveraging a recent result on asymmetric bounds for singular subspace estimation [9]), we conclude by showcasing these beneﬁts with experiments on both synthetic and real world data.

Related work. Multitask learning for sequence data has previously received limited attention. In [16], mixtures of Markov chains are used to model dynamic user proﬁles. Tackling the multitask problem with nonparametric Bayesian methods is investigated in [15] to model related time series with Beta processes and in [23] to discover relationships between related datasets using nested Dirichlet process and inﬁnite HMMs. Extending recurrent neural networks to the multitask setting has also recently received some interest (see e.g. [21, 22]). To the best of our knowledge, this paper constitutes the ﬁrst attempt to tackle the multitask problem for the class of functions computed by general WFAs.

2 Preliminaries

We ﬁrst present notions on weighted automata, spectral learning of weighted automata and tensors. We start by introducing some notation. We denote by Σ the set of strings on a ﬁnite alphabet Σ. The empty string is denoted by λ and the length of a string x by |x|. For any integer k we let [k] = {1, 2, , k}. We use lower case bold letters for vectors (e.g. v Rd1), upper case bold letters for matrices (e.g. M Rd1 d2) and bold calligraphic letters for higher order tensors (e.g. T Rd1 d2 d3). The ith row (resp. column) of a matrix M will be denoted by Mi,: (resp. M:,i). This notation is extended to slices of a tensor in the straightforward way. Given a matrix M Rd1 d2, we denote by M its Moore-Penrose pseudo-inverse and by vec(M) Rd1d2 its vectorization.

Weighted ﬁnite automaton. A weighted ﬁnite automaton (WFA) with n states is a tuple A = (α, {Aσ}σ Σ, ω) where α, ω Rn are the initial and ﬁnal weights vectors respectively, and Aσ Rn n is the transition matrix for each symbol σ Σ. A WFA computes a function f A : Σ R deﬁned for each word x = x1x2 xk Σ by f A(x) = α Ax1Ax2 Axkω.

By letting Ax = Ax1Ax2 Axk for any word x = x1x2 xk Σ we will often use the shorter notation f A(x) = α Axω. A WFA A with n states is minimal if its number of states is minimal, i.e. any WFA B such that f A = f B has at least n states. A function f : Σ R is recognizable if it can be computed by a WFA. In this case the rank of f is the number of states of a minimal WFA computing f, if f is not recognizable we let rank(f) = .

Hankel matrix. The Hankel matrix Hf RΣ Σ associated with a function f : Σ R is the inﬁnite matrix with entries (Hf)u,v = f(uv) for u, v Σ . The spectral learning algorithm for WFAs relies on the following fundamental relation between the rank of f and the rank of Hf.

Theorem 1. [10, 14] For any function f : Σ R, rank(f) = rank(Hf).

Spectral learning. Showing that the rank of the Hankel matrix is upper bounded by the rank of f is easy: given a WFA A = (α, {Aσ}σ Σ, ω) with n states, we have the rank n factorization Hf = PS where the matrices P RΣ n and S Rn Σ are deﬁned by Pu,: = α Au and S:,v = Avω for

all u, v Σ . The converse is more tedious to show but its proof is constructive, in the sense that it allows one to build a WFA computing f from any rank n factorization of Hf. This construction is the cornerstone of the spectral learning algorithm and is given in the following corollary.

Corollary 2. [4, Lemma 4.1] Let f : Σ R be a recognizable function with rank n, let H RΣ Σ be its Hankel matrix, and for each σ Σ let Hσ RΣ Σ be deﬁned by Hσ u,v = f(uσv) for all u, v Σ .

Then, for any P RΣ n, S Rn Σ such that H = PS, the WFA A = (α, {Aσ}σ Σ, ω) where α = Pλ,:, ω = S:,λ, and Aσ = P HσS is a minimal WFA for f.

In practice, ﬁnite sub-blocks of the Hankel matrices are used. Given ﬁnite sets of preﬁxes and sufﬁxes P, S Σ , let HP,S, {Hσ P,S}σ Σ be the ﬁnite sub-blocks of H whose rows (resp. columns) are indexed by preﬁxes in P (resp. sufﬁxes in S). One can show that if P and S are such that λ P S and rank(H) = rank(HP,S), then the previous corollary still holds, i.e. a minimal WFA computing f can be recovered from any rank n factorization of HP,S. The spectral method thus consists in estimating the matrices HP,S, Hσ P,S from training data (using e.g. empirical frequencies if f is stochastic), ﬁnding a low-rank factorization of HP,S (using e.g. SVD) and constructing a WFA approximating f using Corollary 2.

Tensors. We make a sporadic use of tensors in this paper, we thus introduce the few necessary deﬁnitions and notations; more details can be found in [18]. A 3rd order tensor T Rd1 d2 d3 can be seen as a multidimensional array (T i1,i2,i3 : i1 [d1], i2 [d2], , i3 [d3]). The mode-n ﬁbers of T are the vectors obtained by ﬁxing all indices except the nth one, e.g. T :,i2,i3 Rd1. The nth mode ﬂattening of T is the matrix having the mode-n ﬁbers of T for columns and is denoted by e.g. T (1) Rd1 d2d3. The mode-1 matrix product of a tensor T Rd1 d2 d3 and a matrix X Rm d1 is a tensor of size m d2 d3 denoted by T 1 X and deﬁned by the relation Y = T 1 X Y(1) = XT (1); the mode-n product for n = 2, 3 is deﬁned similarly.

3 Vector-Valued WFAs for Multitask Learning

In this section, we present a notion of relatedness between WFAs that we formalize by introducing the novel model of vector-valued weighted automaton. We then propose a multitask learning algorithm for WFAs by designing a spectral learning algorithm for vector-valued WFAs.

A notion of relatedness between WFAs. The basic idea behind our approach emerges from interpreting the computation of a WFA as a linear model in some feature space. Indeed, the computation of a WFA A = (α, {Aσ}σ Σ, ω) with n states on a word x Σ can be seen as ﬁrst mapping x to an n-dimensional feature vector through a compositional feature map φ : Σ Rn, and then applying a linear form in the feature space to obtain the ﬁnal value f A(x) = φ(x), ω . The feature map is deﬁned by φ(x) = α Ax for all x Σ and it is compositional in the sense that for any x Σ

and any σ Σ we have φ(xσ) = φ(x) Aσ. We will say that such a feature map is minimal if the linear space V Rn spanned by the vectors {φ(x)}x Σ is of dimension n. Theorem 1 implies that the dimension of V is actually equal to the rank of f A, showing that the notion of minimal feature map naturally coincides with the notion of minimal WFA.

A notion of relatedness between WFAs naturally arises by considering to which extent two (or more) WFAs can share a joint feature map φ. More precisely, consider two recognizable functions f1, f2 : Σ R of rank n1 and n2 respectively, with corresponding feature maps φ1 : Σ Rn1 and φ2 : Σ Rn2. Then, a joint feature map for f1 and f2 always exists and is obtained by considering the direct sum φ1 φ2 : Σ Rn1+n2 that simply concatenates the feature vectors φ1(x) and φ2(x) for any x Σ . However, this feature map may not be minimal, i.e. there may exist another joint feature map of dimension n < n1 + n2. Intuitively, the smaller this minimal dimension n is the more related the two tasks are, with the two extremes being on the one hand n = n1 + n2 where the two tasks are independent, and on the other hand e.g. n = n1 where one of the (minimal) feature maps φ1, φ2 is sufﬁcient to predict both tasks.

Vector-valued WFA. We now introduce a computational model for vector-valued functions on strings that will help formalize this notion of relatedness between WFAs.

Deﬁnition 1. A d-dimensional vector-valued weighted ﬁnite automaton (vv-WFA) with n states is a tuple A = (α, {Aσ}σ Σ, Ω) where α Rn is the initial weights vector, Ω Rn d is the matrix of ﬁnal weights, and Aσ Rn n is the transition matrix for each symbol σ Σ. A vv-WFA computes a function f A : Σ Rd deﬁned by f A(x) = α Ax1Ax2 AxkΩ for each word x = x1x2 xk Σ .

We extend the notions of recognizability, minimality and rank of a WFA in the straightforward way: a function f : Σ Rd is recognizable if it can be computed by a vv-WFA, a vv-WFA is minimal if its number of states is minimal, and the rank of f is the number of states of a minimal vv-WFA computing f. A d-dimensional vv-WFA can be seen as a collection of d WFAs that all share their initial vectors and transition matrices but have different ﬁnal vectors. Alternatively, one could take a dual approach and deﬁne vv-WFAs as a collection of WFAs sharing transitions and ﬁnal vectors4.

vv-WFAs and relatedness between WFAs. We now show how the vv-WFA model naturally captures the notion of relatedness presented above. Recall that this notion intends to capture to which extent two recognizable functions f1, f2 : Σ R, of ranks n1 and n2 respectively, can share a joint forward feature map φ : Σ Rn satisfying f1(x) = φ(x), ω1 and f2(x) = φ(x), ω2 for all x Σ , for some ω1, ω2 Rn. Consider the vector-valued function f = [f1, f2] : Σ R2

deﬁned by f(x) = [f1(x), f2(x)] for all x Σ . It can easily be seen that the minimal dimension of a shared forward feature map between f1 and f2 is exactly the rank of f, i.e. the number of states of a minimal vv-WFA computing f. This notion of relatedness can be generalized to more than two functions by considering f = [f1, , fm] for m different recognizable functions f1, , fm of respective ranks n1, , nm. In this setting, it is easy to check that the rank of f lies between max(n1, , nm) and n1 + + nm; smaller values of this rank leads to a smaller dimension of the minimal forward feature map and thus, intuitively, to more closely related tasks. We now formalize this measure of relatedness between recognizable functions. Deﬁnition 2. Given m recognizable functions f1, , fm, we deﬁne their relatedness measure by τ(f1, , fm) = 1 (rank( f) maxi rank(fi))/ P

i rank(fi) where f = [f1, , fm].

One can check that this measure of relatedness takes its values in (0, 1]. We say that tasks are maximally related when their relatedness measure is 1 and independent when it is minimal. Observe that the rank R of a vv-WFA does not give enough information to determine whether one set of tasks is more related than another: the degree of relatedness depends on the relation between R and the ranks of each individual task. The relatedness parameter τ circumvents this issue by measuring where R stands between the maximum rank over the different tasks and the sum of their ranks. Example 1. Let Σ = {a, b, c} and let |x|σ denotes the number of occurrences of σ in x for any σ Σ. Consider the functions deﬁned by f1(x) = 0.5|x|a + 0.5|x|b, f2(x) = 0.3|x|b 0.6|x|c and f3(x) = |x|c for all x Σ . It is easy to check that rank(f1) = rank(f2) = 4 and rank(f3) = 2. Moreover, f2 and f3 are maximally related (indeed rank([f2, f3]) = 4 = rank(f2) thus τ(f2, f3) = 1), f1 and f3 are independent (indeed τ(f1, f3) = 2/3 is minimal since rank([f1, f3]) = 6 = rank(f1) + rank(f3)), and f1 and f2 are related but not maximally related (since 4 = rank(f1) = rank(f2) < rank([f1, f2]) = 6 < rank(f1) + rank(f2) = 8).

Spectral learning of vv-WFAs. We now design a spectral learning algorithm for vv-WFAs. Given a function f : Σ Rd, we deﬁne its Hankel tensor H RΣ d Σ by Hu,:,v = f(uv) for all u, v Σ . We ﬁrst show in Theorem 3 (whose proof can be found in the supplementary material) that the fundamental relation between the rank of a function and the rank of its Hankel matrix can naturally be extended to the vector-valued case. Compared with Theorem 1, the Hankel matrix is now replaced by the mode-1 ﬂattening H(1) of the Hankel tensor (which can be obtained by concatenating the matrices H:,i,: along the horizontal axis).

Theorem 3 (Vector-valued Fliess Theorem). Let f : Σ Rd and let H be its Hankel tensor. Then rank( f) = rank(H(1)).

4Both deﬁnitions performed similarly in multitask experiments on the dataset used in Section 5.2, we thus chose multiple ﬁnal vectors as a convention.

Similarly to the scalar-valued case, this theorem can be leveraged to design a spectral learning algorithm for vv-WFAs. The following corollary (whose proof can be found in the supplementary material) shows how a vv-WFA computing a recognizable function f : Σ Rd of rank n can be recovered from any rank n factorization of its Hankel tensor.

Corollary 4. Let f : Σ Rd be a recognizable function with rank n, let H RΣ d Σ be its Hankel tensor, and for each σ Σ let Hσ RΣ d Σ be deﬁned by Hσ u,:,v = f(uσv) for all u, v Σ .

Then, for any P RΣ n and S Rn d Σ such that H = S 1 P, the vv-WFA A = (α, {Aσ}σ Σ, Ω) deﬁned by α = Pλ,:, Ω= S:,:,λ, and Aσ = P Hσ (1)(S(1)) is a minimal

vv-WFA computing f.

Similarly to the scalar-valued case, one can check that the previous corollary also holds for any ﬁnite sub-tensors HP,S, {Hσ P,S}σ Σ of H indexed by preﬁxes and sufﬁxes in P, S Σ , whenever P and S are such that λ P S and rank(H(1)) = rank((HP,S)(1)); we will call such a basis (P, S) complete. The spectral learning algorithm for vv-WFAs then consists in estimating these Hankel tensors from training data and using Corollary 4 to recover a vv-WFA approximating the target function. Of course a noisy estimate of the Hankel tensor ˆ H will not be of low rank and the factorization ˆ H = S 1 P should only be performed approximately in order to counter the presence of noise. In practice a low rank approximation of ˆ H(1) is obtained using truncated SVD.

Multitask learning of WFAs. Let us now go back to the multitask learning problem and let f1, fm : Σ R be multiple functions we wish to infer in the form of WFAs. The spectral learning algorithm for vv-WFAs naturally suggests a way to tackle this multitask problem: by learning f = [f1, , fm] in the form of a vv-WFA, rather than independently learning a WFA for each task fi, we implicitly enforce the discovery of a joint forward feature map shared among all tasks.

We will now see how a further step can be added to this learning scheme to enforce more robustness to noise. The motivation for this additional step comes from the observation that even though a d-dimensional vv-WFA A = (α, {Aσ}σ Σ, Ω) may be minimal, the corresponding scalar-valued WFAs Ai = α, {Aσ}σ Σ, Ω:,i for i [d] may not be. Suppose for example that A1 is not minimal. This implies that some part of its state space does not contribute to the function f1 but comes from asking for a rich enough state representation that can predict other tasks as well. Moreover, when one learns a vv-WFA from noisy estimates of the Hankel tensors, the rank R approximation ˆ H(1) PS(1) somehow annihilates the noise contained in the space orthogonal to the top R singular vectors of ˆ H(1), but when the WFA A1 has rank R1 < R we intuitively see that there is still a subspace of dimension R R1 containing only irrelevant features. In order to circumvent this issue, we would like to project down the (scalar-valued) WFAs Ai down to their true dimensions, intuitively enforcing each predictor to use as few features as possible for each task, and thus annihilating the noise lying in the corresponding irrelevant subspaces. To achieve this we will make use of the following proposition that explicits the projections needed to obtain minimal scalar-valued WFAs from a given vv-WFA (the proof is given in the supplementary material).

Proposition 1. Let f : Σ Rd be a function computed by a minimal vv-WFA A = (α, {Aσ}σ Σ, Ω) with n states and let P, S Σ be a complete basis for f. For any i [d], let fi : Σ R be deﬁned by fi(x) = f(x)i for all x Σ and let ni denote the rank of fi.

Let P RP n be deﬁned by Px,: = α Ax for all x P and, for i [d], let Hi RP S be the Hankel matrix of fi and let Hi = Ui Di V i be its thin SVD (i.e. Di Rni ni).

Then, for any i [d], the WFA Ai = αi, {Aσ i }σ Σ}, ωi deﬁned by

α i = α P Ui, ωi = U i PΩ:,i and Aσ i = U i PAσP Ui for each σ Σ,

is a minimal WFA computing fi.

Given noisy estimates ˆ H, { ˆ Hσ}σ Σ of the Hankel tensors of a function f and estimates R of the rank of f and Ri of the ranks of the fi s, the ﬁrst step of the learning algorithm consists in applying Corollary 4 to the factorization ˆ H(1) U(DV ) obtained by truncated SVD to get a

vv-WFA A approximating f. Then, Proposition 1 can be used to project down each WFA Ai by estimating Ui with the top Ri left singular vectors of ˆ H:,i,:. The overall procedure for our Multi-Task Spectral Learning (MT-SL) is summarized in Algorithm 1 where lines 1-3 correspond to the vv-WFA estimation while lines 4-7 correspond to projecting down the corresponding scalar-valued WFAs. To further motivate the projection step, let us consider the case when m tasks are completely unrelated, and each of them requires n states. Single-task learning would lead to a model with O |Σ|mn2

parameters, while the multi-task learning approach would return a larger model of size O |Σ|(mn)2 ; the projection step eliminates such redundancy.

Algorithm 1 MT-SL: Spectral Learning of vector-valued WFA for multitask learning

Input: Empirical Hankel tensors ˆ H, { ˆ Hσ}σ Σ of size P m S for the target function f = [f1, , fm] (where P, S are subsets of Σ both containing λ), a common rank R, and task speciﬁc ranks Ri for i [m]. Output: WFAs Ai approximating fi for each i [d].

1: Compute the rank R truncated SVD ˆ H(1) UDV . 2: Let A = (α, {Aσ}σ Σ, Ω) be the vv-WFA deﬁned by

α = Uλ,:, , Ω= U ( ˆ H:,:,λ) and Aσ = U ˆ Hσ (1)( ˆ H(1)) U for each σ Σ.

3: for i = 1 to m do 4: Compute the rank Ri truncated SVD ˆ H:,i,: Ui Di V i . 5: Let Ai = U i Uα, {U i UAσU Ui}σ Σ, U i UΩ:,i 6: end for 7: return A1, , Am.

4 Theoretical Analysis

Computational complexity. The computational cost of the classical spectral learning algorithm (SL) is in O N + R|P||S| + R2|P||Σ| where the ﬁrst term corresponds to estimating the Hankel matrices from a sample of size N, the second one to the rank R truncated SVD, and the third one to computing the transition matrices Aσ. In comparison, the computational cost of MT-SL is in O m N + (m R + P

i Ri)|P||S| + (m R2 + P

i R2 i )|P||Σ| , showing that the increase in complexity is essentially linear in the number of tasks m.

Robustness in subspace estimation. In order to give some theoretical insights on the potential beneﬁts of MT-SL, let us consider the simple case where the tasks are maximally related with common rank R = R1 = = Rm. Let ˆH1, , ˆHm RP S be the empirical Hankel matrices for the m tasks and let Ei = ˆHi Hi be the error terms, where Hi is the true Hankel matrix for the ith task. Then the ﬂattening ˆH = ˆ H(1) R|P| m|S| (resp. H = H(1)) can be obtained by stacking the matrices ˆHi (resp. Hi) along the horizontal axis. Consider the problem of learning the ﬁrst task. One key step of both SL and MT-SL resides in estimating the left singular subspace of H1 and H respectively from their noisy estimates. When the tasks are maximally related, this space U is the same for H and H1, , Hm and we intuitively see that the beneﬁts of MT-SL will stem from the fact that the SVD of ˆH should lead to a more accurate estimation of U than the one only relying on ˆH1. It is also intuitive to see that since the Hankel matrices ˆHi have been stacked horizontally, the estimation of the right singular subspace might not beneﬁt from performing SVD on ˆH. However, classical results on singular subspace estimation (see e.g. [29, 20]) provide uniform bounds for both left and right singular subspaces (i.e. bounds on the maximum of the estimation errors for the left and right spaces). To circumvent this issue, we use a recent result on rate optimal asymmetric perturbation bounds for left and right singular spaces [9] to obtain the following theorem relating the ratio between the dimensions of a matrix to the quality of the subspace estimation provided by SVD (the proof can be found in the supplementary material).

Theorem 5. Let M Rd1 d2 be of rank R and let ˆM = M + E where E is a random noise term such that vec(E)

E F follows a uniform distribution on the unit sphere in Rd1d2. Let ΠU, Π ˆU Rd1 d1

be the matrices of the orthogonal projections onto the space spanned by the top R left singular vectors of M and ˆM respectively.

Let δ > 0, let α = s R(M) be the smallest non-zero singular value of M and suppose that E F α/2. Then, with probability at least 1 δ,

ΠU Π ˆU F 4

(d1 R)R + 2 log(1/δ)

α + E 2 F α2

A few remarks on this theorem are in order. First, the Frobenius norm between the projection matrices measures the distance between the two subspaces (it is in fact proportional to the classical sin-theta distance between subspaces). Second, the assumption E F α/2 corresponds to the magnitude of the noise being small compared to the magnitude of M (and in particular it implies E F

α < 1); this is a reasonable and common assumption in subspace identiﬁcation problems, see e.g. [30]. Lastly, as d2 grows the ﬁrst term in the upper bound becomes irrelevant and the error is dominated by the quadratic term, which decreases with E F faster than classical results. Intuitively this tells us that there is a ﬁrst regime where growing d2 (i.e. adding more tasks) is beneﬁcial, until the point where the quadratic term dominates (and where the bound becomes somehow independent of d2).

Going back to the power of MT-SL to leverage information from related tasks, let E R|P| m|S| be the matrix obtained by stacking the noise matrices Ei along the horizontal axis. If we assume that the entries of the error terms Ei are i.i.d. from e.g. a normal distribution, we can apply the previous proposition to the left singular subspaces of ˆ H(1) and H(1). One can check that in this case we have E 2 F = Pm i=1 Ei 2 F and α2 = s R(H)2 Pm i=1 s R(Hi)2 (since R = R1 = = Rm when tasks are maximally related). Thus, if the norms of the noise terms Ei are roughly the same, and so are the smallest non-zero singular values of the matrices Hi, we get E F

α O ( E1 F /s R(H1)). Hence, given enough tasks, the estimation error of the left singular subspace of H1 in the multitask setting (i.e. by performing SVD on ˆ H(1)) is intuitively in O E1 2 F /s R(H1)2 while it is only in O ( E1 F /s R(H1)) when relying solely on ˆH1, which shows the potential beneﬁts of MT-SL. Indeed, as the amount of training data increases the error in the estimated matrices decreases, thus T = E1 F /s R(H1) goes to 0 and an error of order O T 2 decays faster than one of order O (T).

5 Experiments

We evaluate the performance of the proposed multitask learning method (MT-SL) on both synthetic and real world data. We use two performance metrics: perplexity per character on a test set T, which is deﬁned by perp(h) = 2 1

x T log(h(x)) where M is the number of symbols in the test set and h is the hypothesis, and word error rate (WER) which measures the proportion of mis-predicted symbols averaged over all preﬁxes in the test set (when the most likely symbol is predicted). Both experiments are in a stochastic setting, i.e. the functions to be learned are probability distributions, and explore the regime where the learner has access to a small training sample drawn from the target task, while larger training samples are available for related tasks. We compare MT-SL with the classical spectral learning method (SL) for WFAs (note that SL has been extensively compared to EM and n-gram in the literature, see e.g. [4] and [5] and references therein). For both methods the preﬁx set P (resp. sufﬁx set S) is chosen by taking the 1, 000 most frequent preﬁxes (resp. sufﬁxes) in the training data of the target task, and the values of the ranks are chosen using a validation set.

5.1 Synthetic Data

We ﬁrst assess the validity of MT-SL on synthetic data. We randomly generated stochastic WFAs using the process used for the PAutoma C competition [27] with symbol sparsity 0.4 and transition sparsity 0.15, for an alphabet Σ of size 10. We generated related WFAs5 sharing a joint feature

5More precisely, we ﬁrst generate a probabilistic automaton (PA) AS = (αS, {Aσ S}σ Σ, ωS) with d S states. Then, for each task i = 1, , m we generate a second PA AT = (αT , {Aσ T }σ Σ, ωT ) with d T states and a random vector ω [0, 1]d S+d T . Both PAs are generated using the process described in [27]. The task fi is then obtained as the distribution computed by the stochastic WFA αS αT , {Aσ S Aσ T }σ Σ, ω with ω = ω/Z where the constant Z is chosen such that P

x Σ fi(x) = 1.

d S = 10, d T = 0

word error rate

d S = 10, d T = 5

true model SL MT-SL, 2 tasks MT-SL, 4 tasks MT-SL, 8 tasks

word error rate

d S = 10, d T = 10

word error rate

Figure 1: Comparison (on synthetic data) between the spectral learning algorithm (SL) and our multitask algorithm (MT-SL) for different numbers of tasks and different degrees of relatedness between the tasks: d S is the dimension of the space shared by all tasks and d T the one of the task-speciﬁc space (see text for details).

space of dimension d S = 10 and each having a task speciﬁc feature space of dimension d T , i.e. for m tasks f1, , fm each WFA computing fi has rank d S + d T and the vv-WFA computing f = [f1, , fm] has rank d S + md T . We generated 3 sets of WFAs for different task speciﬁc dimensions d T = 0, 5, 10. The learner had access to training samples of size 5, 000 drawn from each related tasks f2, , fm and a training sample of sizes ranging from 50 to 5, 000 drawn from the target task f1. Results on a test set of size 1, 000 averaged over 10 runs are reported in Figure 1.

For both evaluation measures, when the task speciﬁc dimension is small compared to the dimension of the joint feature space, i.e. d T = 0, 5, MT-SL clearly outperforms SL that only relies on the target task data. Moreover, increasing the number of related tasks tends to improve the performances of MT-SL. However, when d S = d T = 10, MT-SL performs similarly in terms of perplexity and WER, showing that the multitask approach offers no beneﬁts when the tasks are too loosely related. Additional experimental results for the case of totally unrelated tasks (d S = 0, d T = 10) as well as comparisons with MT-SL without the projection step (i.e. without lines 4-7 of Algorithm 1) are presented in the supplementary material.

5.2 Real Data

We evaluate MT-SL on 33 languages from the Universal Dependencies (UNIDEP) 1.4 treebank [24], using the 17-tag universal Part of Speech (Po S) tagset. This dataset contains sentences from various languages where each word is annotated with Google universal Po S tags [25], and thus can be seen as a collection of samples drawn from 33 distributions over strings on an alphabet of size 17. For each language, the available data is split between a training, a validation and a test set (80%, 10%, 10%). For each language and for various sizes of training samples, we compare independently learning the target task with SL against using MT-SL to exploit training data from related tasks. We tested two ways of selecting the related tasks: (1) all other languages are used and (2) for each language we selected the 4 closest languages w.r.t. the distance between the subspaces spanned by the top 50 left singular vectors of their Hankel matrices6.

We compare MT-SL against SL (using only the training data for the target task) and against a naive baseline where all data from different tasks are bagged together and used as a training set for SL (SL-bagging). We also include the results obtained using MT-SL without the projection step (MTSL-noproj). We report the average relative improvement of MT-SL, SL-bagging and MT-SL-noproj w.r.t. SL over all languages in Table 1, e.g. for perplexity we report 100 (psl pmt)/psl where psl (resp. pmt) is the perplexity obtained by SL (resp. MT-SL) on the test set. We see that the multitask approach leads to improved results for both metrics, that the beneﬁts tend to be greater for small training sizes, and that restricting the number of auxiliary tasks is overall beneﬁcial. To give a

6The common basis (P, S) for these Hankel matrices is chosen by taking the union of the 100 most frequent preﬁxes and sufﬁxes in each training sample.

Table 1: Average relative improvement with respect to single task spectral learning (SL) of the multitask approach (with and without the projection step: MT-SL and MT-SL-noproj) and the bagging baseline (SL-bagging) on the UNIDEP dataset.

(a) Perplexity average relative improvement (in %).

Training size 100 500 1000 5000 all available data

Related tasks: all other languages

MT-SL 7.0744 ( 7.76) 3.6666 ( 5.22) 3.2879 ( 5.17) 3.4187 ( 5.57) 3.1574 ( 5.48) MT-SL-noproj 2.9884 ( 9.82) 2.2469 ( 7.49) 0.8509 ( 7.41) 1.1658 ( 6.59) 0.6958 ( 6.38) SL-bagging 19.00 ( 29.1) 13.32 ( 22.4) 10.65 ( 19.7) 5.371 ( 14.6) 2.630 ( 13.0)

Related tasks: 4 closest languages

MT-SL 6.0069 ( 6.76) 4.3670 ( 5.83) 4.4049 ( 5.50) 2.9689 ( 5.87) 2.8229 ( 5.90) MT-SL-noproj 4.5732 ( 8.78) 2.9421 ( 7.83) 2.4549 ( 7.15) 2.2166 ( 6.82) 2.1451 ( 6.52) SL-bagging 18.41 ( 28.4) 12.73 ( 22.0) 10.34 ( 20.1) 3.086 ( 12.7) 0.1926 ( 10.2)

(b) WER average relative improvement (in %).

Training size 100 500 1000 5000 all available data

Related tasks: all other languages

MT-SL 1.4919 ( 2.37) 1.3786 ( 2.94) 1.2281 ( 2.62) 1.4964 ( 2.70) 1.4932 ( 2.77) MT-SL-noproj 5.763 ( 6.82) 9.454 ( 8.95) 9.197 ( 7.25) 9.201 ( 6.02) 9.600 ( 5.55) SL-bagging 3.067 ( 10.8) 6.998 ( 11.6) 7.788 ( 9.88) 8.791 ( 9.54) 8.611 ( 9.74)

Related tasks: 4 closest languages

MT-SL 2.0883 ( 3.26) 1.5175 ( 2.87) 1.2961 ( 2.57) 1.3080 ( 2.55) 1.2160 ( 2.31) MT-SL-noproj 4.139 ( 5.10) 5.841 ( 6.29) 5.399 ( 6.26) 5.526 ( 4.93) 5.556 ( 4.90) SL-bagging 0.3372 ( 7.80) 3.045 ( 8.12) 3.822 ( 7.33) 4.350 ( 6.90) 3.588 ( 7.06)

concrete example, on the Basque task with a training set of size 500, the WER was reduced from 76% for SL to 70% using all other languages as related tasks, and to 65% using the 4 closest tasks (Finnish, Polish, Czech and Indonesian). Overall, both SL-bagging and MT-SL-noproj obtain worst performance than MT-SL (though MT-SL-noproj still outperforms SL in terms are perplexity while SL-bagging performs almost always worse than SL). Detailed results on all languages, along with the list of closest languages used for method (2), are reported in the supplementary material.

6 Conclusion

We introduced the novel model of vector-valued WFA that allowed us to deﬁne a notion of relatedness between recognizable functions and to design a multitask spectral learning algorithm for WFAs (MTSL). The beneﬁts of MT-SL have been theoretically motivated and showcased on both synthetic and real data experiments. In future works, we plan to apply MT-SL in the context of reinforcement learning and to identify other areas of machine learning where vv-WFAs could prove to be useful. It would also be interesting to investigate a weighted approach such as the one presented in [19] for classical spectral learning; this could prove useful to handle the case where the amount of available training data differs greatly between tasks.

Acknowledgments

G. Rabusseau acknowledges support of an IVADO postdoctoral fellowship. B. Balle completed this work while at Lancaster University. We thank NSERC and CIFAR for their ﬁnancial support.

[1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In NIPS, pages 41 48, 2007. [2] Raphaël Bailly, François Denis, and Liva Ralaivola. Grammatical inference as a principal component analysis problem. In ICML, pages 33 40, 2009. [3] Borja Balle. Learning Finite-State Machines: Algorithmic and Statistical Aspects. Ph D thesis, Universitat Politècnica de Catalunya, 2013. [4] Borja Balle, Xavier Carreras, Franco M Luque, and Ariadna Quattoni. Spectral learning of weighted automata. Machine learning, 96(1-2):33 63, 2014. [5] Borja Balle, William L. Hamilton, and Joelle Pineau. Methods of moments for learning stochastic languages: Uniﬁed presentation and empirical comparison. In ICML, pages 1386 1394, 2014. [6] Jonathan Baxter et al. A model of inductive bias learning. Journal of Artiﬁcal Intelligence Research, 12(149-198):3, 2000. [7] Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines, pages 567 580. Springer, 2003. [8] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive state representations. International Journal of Robotics Research, 30(7):954 966, 2011. [9] T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. ar Xiv preprint ar Xiv:1605.00353, 2016. [10] Jack W. Carlyle and Azaria Paz. Realizations by stochastic ﬁnite automata. Journal of Computer and System Sciences, 5(1):26 40, 1971. [11] Rich Caruana. Multitask learning. In Learning to learn, pages 95 133. Springer, 1998. [12] Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster, and Lyle H. Ungar. Experiments with spectral learning of latent-variable pcfgs. In NAACL-HLT, pages 148 157, 2013. [13] François Denis and Yann Esposito. On rational stochastic languages. Fundamenta Informaticae, 86(1, 2):41 77, 2008. [14] Michel Fliess. Matrices de Hankel. Journal de Mathématiques Pures et Appliquées, 53(9):197 222, 1974. [15] Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical systems with beta processes. In NIPS, pages 549 557, 2009. [16] Mark A Girolami and Ata Kabán. Simplicial mixtures of markov chains: Distributed modelling of dynamic user proﬁles. In NIPS, volume 16, pages 9 16, 2003. [17] Daniel J. Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. In COLT, 2009. [18] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009. [19] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-rank spectral learning with weighted loss functions. In AISTATS, 2015. [20] Ren-Cang Li. Relative perturbation theory: II. eigenspace and singular subspace variations. SIAM Journal on Matrix Analysis and Applications, 20(2):471 492, 1998. [21] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classiﬁcation with multi-task learning. In IJCAI, pages 2873 2879, 2016. [22] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. ar Xiv preprint ar Xiv:1511.06114, 2015. [23] Kai Ni, Lawrence Carin, and David Dunson. Multi-task learning for sequential data via ihmms and the nested dirichlet process. In ICML, pages 689 696, 2007. [24] Joakim Nivre, Zeljko Agi c, Lars Ahrenberg, et al. Universal dependencies 1.4, 2016. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University. [25] Slav Petrov, Dipanjan Das, and Ryan Mc Donald. A universal part-of-speech tagset. ar Xiv preprint ar Xiv:1104.2086, 2011. [26] Michael Thon and Herbert Jaeger. Links between multiplicity automata, observable operator models and predictive state representations: a uniﬁed learning framework. Journal of Machine Learning Research, 16:103 147, 2015. [27] Sicco Verwer, Rémi Eyraud, and Colin De La Higuera. Results of the pautomac probabilistic automaton learning competition. In ICGI, pages 243 248, 2012. [28] Boyu Wang, Joelle Pineau, and Borja Balle. Multitask generalized eigenvalue program. In AAAI, pages 2115 2121, 2016. [29] Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972. [30] Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal component analysis. In NIPS, pages 1649 1656, 2006.