# learning_shuffle_ideals_under_restricted_distributions__fee36592.pdf

Learning Shufﬂe Ideals Under Restricted Distributions

Dongqu Chen Department of Computer Science Yale University dongqu.chen@yale.edu

The class of shufﬂe ideals is a fundamental sub-family of regular languages. The shufﬂe ideal generated by a string set U is the collection of all strings containing some string u U as a (not necessarily contiguous) subsequence. In spite of its apparent simplicity, the problem of learning a shufﬂe ideal from given data is known to be computationally intractable. In this paper, we study the PAC learnability of shufﬂe ideals and present positive results on this learning problem under element-wise independent and identical distributions and Markovian distributions in the statistical query model. A constrained generalization to learning shufﬂe ideals under product distributions is also provided. In the empirical direction, we propose a heuristic algorithm for learning shufﬂe ideals from given labeled strings under general unrestricted distributions. Experiments demonstrate the advantage for both efﬁciency and accuracy of our algorithm.

1 Introduction

The learnablity of regular languages is a classic topic in computational learning theory. The applications of this learning problem include natural language processing (speech recognition, morphological analysis), computational linguistics, robotics and control systems, computational biology (phylogeny, structural pattern recognition), data mining, time series and music ([7, 14 18, 20, 21]). Exploring the learnability of the family of formal languages is signiﬁcant to both theoretical and applied realms.

Valiant s PAC learning model introduces a clean and elegant framework for mathematical analysis of machine learning and is one of the most widely-studied theoretical learning models ([22]). In the PAC learning model, unfortunately, the class of regular languages, or equivalently the concept class of deterministic ﬁnite automata (DFA), is known to be inherently unpredictable ([1, 9, 19]). In a modiﬁed version of Valiant s model which allows the learner to make membership queries, Angluin [2] has shown that the concept class of regular languages is PAC learnable.

Throughout this paper we study the PAC learnability of a subclass of regular languages, the class of (extended) shufﬂe ideals. The shufﬂe ideal generated by an augmented string U is the collection of all strings containing some u U as a (not necessarily contiguous) subsequence, where an augmented string is a ﬁnite concatenation of symbol sets (see Figure 1 for an illustration). The special class of shufﬂe ideals generated by a single string is called the principal shufﬂe ideals. Unfortunately, even such a simple class is not PAC learnable, unless RP=NP ([3]). However, in most application scenarios, the strings are drawn from some particular distribution we are interested in. Angluin et al. [3] prove under the uniform string distribution, principal shufﬂe ideals are PAC learnable. Nevertheless, the requirement of complete knowledge of the distribution, the dependence on the symmetry of the uniform distribution and the restriction of principal shufﬂe ideals lead to the lack of generality of the algorithm. Our main contribution in this paper is to present positive results

Figure 1: The DFA accepting precisely the shufﬂe ideal of U = (a|b|d)a(b|c) over Σ = {a, b, c, d}.

on learning the class of shufﬂe ideals under element-wise independent and identical distributions and Markovian distributions. Extensions of our main results include a constrained generalization to learning shufﬂe ideals under product distributions and a heuristic method for learning principal shufﬂe ideals under general unrestricted distributions.

After introducing the preliminaries in Section 2, we present our main result in Section 3: the extended class of shufﬂe ideals is PAC learnable from element-wise i.i.d. strings. That is, the distributions of the symbols in a string are identical and independent of each other. A constrained generalization to learning shufﬂe ideals under product distributions is also provided. In Section 4, we further show the PAC learnability of principal shufﬂe ideals when the example strings drawn from Σ n are generated by a Markov chain with some lower bound assumptions on the transition matrix. In Section 5, we propose a greedy algorithm for learning principal shufﬂe ideals under general unrestricted distributions. Experiments demonstrate the advantage for both efﬁciency and accuracy of our heuristic algorithm.

2 Preliminaries

We consider strings over a ﬁxed ﬁnite alphabet Σ. The empty string is λ. Let Σ be the Kleene star of Σ and Σ be the collection of all subsets of Σ. As strings are concatenations of symbols, we similarly deﬁne augmented strings as concatenations of unions of symbols.

Deﬁnition 1 (Alphabet, simple string and augmented string) Let Σ be a non-empty ﬁnite set of symbols, called the alphabet. A simple string over Σ is any ﬁnite sequence of symbols from Σ, and Σ is the collection of all simple strings. An augmented string over Σ is any ﬁnite concatenation of symbol sets from Σ , and (Σ ) is the collection of all augmented strings.

Denote by s the cardinality of Σ. Because an augmented string only contains strings of the same length, the length of an augmented string U, denoted by |U|, is the length of any u U. We use exponential notation for repeated concatenation of a string with itself, that is, vk is the concatenation of k copies of string v. Starting from index 1, we denote by vi the i-th symbol in string v and use notation v[i, j] = vi . . . vj for 1 i j |v|. Deﬁne the binary relation on (Σ ) , Σ as follows. For a simple string w, w v holds if and only if there is a witness i = (i1 < i2 < . . . < i|w|) such that vij = wj for all integers 1 j |w|. For an augmented string W, W v if and only if there exists some w W such that w v. When there are several witnesses for W v, we may order them coordinate-wise, referring to the unique minimal element as the leftmost embedding. We will write IW v to denote the position of the last symbol of W in its leftmost embedding in v (if the latter exists; otherwise, IW v = ).

Deﬁnition 2 (Extended/Principal Shufﬂe Ideal) The (extended) shufﬂe ideal of an augmented string U (Σ )L is a regular language deﬁned as X(U) = {v Σ | u U, u v} = Σ U1Σ U2Σ . . . Σ ULΣ . A shufﬂe ideal is principal if it is generated by a simple string.

A shufﬂe ideal is an ideal in order theory and was originally deﬁned for lattices. Denote by the class of principal shufﬂe ideals and by X the class of extended shufﬂe ideals. Unless otherwise stated, in this paper shufﬂe ideal refers to the extended ideal. An example is given in Figure 1. The feasibility of determining whether a string is in the class X(U) is obvious.

Lemma 1 Evaluating relation U x and meanwhile determining IU x is feasible in time O(|x|).

In a computational learning model, an algorithm is usually given access to an oracle providing information about the sample. In Valiant s work [22], the example oracle EX(c, D) was deﬁned,

where c is the target concept and D is a distribution over the instance space. On each call, EX(c, D) draws an input x independently at random from the instance space I under the distribution D, and returns the labeled example x, c(x) .

Deﬁnition 3 (PAC Learnability: [22]) Let C be a concept class over the instance space I. We say C is probably approximately correctly (PAC) learnable if there exists an algorithm A with the following property: for every concept c C, for every distribution D on I, and for all 0 < ϵ < 1/2 and 0 < δ < 1/2, if A is given access to EX(c, D) on I and inputs ϵ and δ, then with probability at least 1 δ, A outputs a hypothesis h H satisfying Prx D[c(x) = h(x)] ϵ. If A runs in time polynomial in 1/ϵ, 1/δ and the representation size of c, we say that C is efﬁciently PAC learnable.

We refer to ϵ as the error parameter and δ as the conﬁdence parameter. If the error parameter is set to 0, the learning is exact ([6]). Kearns [11] extended Valiant s model and introduced the statistical query oracle STAT(c, D). Kearns oracle takes as input a statistical query of the form (χ, τ). Here χ is any mapping of a labeled example to {0, 1} and τ [0, 1] is called the noise tolerance. STAT(c, D) returns an estimate for the expectation IEχ, that is, the probability that χ = 1 when the labeled example is drawn according to D. A statistical query can have a condition so IEχ can be a conditional probability. This estimate is accurate within additive error τ.

Deﬁnition 4 (Legitimacy and Feasibility: [11]) A statistical query χ is legimate and feasible if and only if with respect to 1/ϵ, 1/τ and representation size of c:

1. Query χ maps a labeled example x, c(x) to {0, 1};

2. Query χ can be efﬁciently evaluated in polynomial time;

3. The condition of χ, if any, can be efﬁciently evaluated in polynomial time;

4. The probability of the condition of χ, if any, should be at least polynomially large.

Throughout this paper, the learnability of shufﬂe ideals is studied in the statistical query model. Kearns [11] proves that oracle STAT(c, D) is weaker than oracle EX(c, D). In words, if a concept class is PAC learnable from STAT(c, D), then it is PAC learnable from EX(c, D), but not necessarily vice versa.

3 Learning shufﬂe ideals from element-wise i.i.d. strings

Although learning the class of shufﬂe ideals has been proved hard, in most scenarios the string distribution is restricted or even known. A very usual situation in practice is that we have some prior knowledge of the unknown distribution. One common example is the string distributions where each symbol in a string is generated independently and identically from an unknown distribution. It is element-wise i.i.d. because we view a string as a vector of symbols. This case is general enough to cover some popular distributions in applications such as the uniform distribution and the multinomial distribution. In this section, we present as our main result a statistical query algorithm for learning the concept class of extended shufﬂe ideals from element-wise i.i.d. strings and provide theoretical guarantees of its computational efﬁciency and accuracy in the statistical query model. The instance space is Σn. Denote by U the augmented pattern string that generates the target shufﬂe ideal and by L = |U| the length of U.

3.1 Statistical query algorithm

Before presenting the algorithm, we deﬁne function θV,a( ) and query χV,a( , ) for any augmented string V (Σ ) n and any symbol a Σ as as follows.

θV,a(x) = a if V x[1, n 1] x IV x+1 if V x[1, n 1]

χV,a(x, y) = 1

2(y + 1) given θV,a(x) = a

where y = c(x) is the label of example string x. More precisely, y = +1 if x X(U) and y = 1 otherwise. Our learning algorithm uses statistical queries to recover string U (Σ )L one element at a time. It starts with the empty string V = λ. Having recovered V = U[1, ℓ] where 0 ℓ< L, we infer Uℓ+1 as follows. For each a Σ, the statistical query oracle is called with the query χV,a at the error tolerance τ claimed in Theorem 1. Our key technical observation is that the value of IEχV,a effectively selects Uℓ+1. The query results of χV,a will form two separate clusters such that the maximum difference (variance) inside one cluster is smaller than the minimum difference (gap) between the two clusters, making them distinguishable. The set of symbols in the cluster with larger query results is proved to be Uℓ+1. Notice that this statistical query only works for 0 ℓ< L. To complete the algorithm, the algorithm addresses the trivial case ℓ= L with query Pr[y = +1 | V x] and halts if the query answer is close to 1.

3.2 PAC learnability of ideal X

We show the algorithm described above learns the class of shufﬂe ideals from element-wise i.i.d. strings in the statistical query learning model.

Theorem 1 Under element-wise independent and identical distributions over instance space I = Σn, concept class X is approximately identiﬁable with O(sn) conditional statistical queries from STAT(X, D) at tolerance

40sn2 + 4ϵ or with O(sn) statistical queries from STAT(X, D) at tolerance

τ = 1 ϵ 20sn2 + 2ϵ

16sn(10sn2 + ϵ)

We provide the main idea of the proofs in this section and defer the details and algebra to Appendix A. The proof starts from the legitimacy and feasibility of the algorithm. Since χV,a computes a binary mapping from labeled examples to {0, 1}, the legitimacy is trivial. But χV,a is not feasible for symbols in Σ of small occurrence probabilities. We avoid the problematic cases by reducing the original learning problem to the same problem with a polynomial lower bound assumption Pr[xi = a] ϵ/(2sn) ϵ2/(20sn2 + 2ϵ) for any a Σ and achieve feasibility.

The correctness of the algorithm is based on the intuition that the query result IEχV,a+ of a symbol a+ Uℓ+1 should be greater than that of a symbol a Uℓ+1 and the difference is large enough to tolerate the noise from the oracle. To prove this, we ﬁrst consider the exact learning case. Deﬁne an inﬁnite string U = U[1, ℓ]U[ℓ+ 2, L]U ℓ+1 and let x = xΣ be the extension of x obtained by padding it on the right with an inﬁnite string generated from the same distribution as x. Let Q(j, i) be the probability that the largest g such that U [1, g] x [1, i] is j, or formally

Q(j, i) = Pr[U [1, j] x [1, i] U [1, j + 1] x [1, i]]

By taking the difference between IEχV,a+ and IEχV,a in terms of Q(j, i), we get the query tolerance for exact learning.

Lemma 2 Under element-wise independent and identical distributions over instance space I = Σn, concept class X is exactly identiﬁable with O(sn) conditional statistical queries from STAT(X, D) at tolerance

5Q(L 1, n 1)

Lemma 2 indicates bounding the quantity Q(L 1, n 1) is the key to the tolerance for PAC learning. Unfortunately, the distribution {Q(j, i)} doesn t seem to have any strong properties we know of providing a polynomial lower bound. Instead we introduce new quantity

R(j, i) = Pr[U [1, j] x [1, i] U [1, j] x [1, i 1]]

being the probability that the smallest g such that U [1, j] x [1, g] is i. An important property of distribution {R(j, i)} is its strong unimodality as deﬁned below.

Deﬁnition 5 (Unimodality: [8]) A distribution {P(i)} with all support on the lattice of integers is unimodal if and only if there exists at least one integer K such that P(i) P(i 1) for all i K and P(i + 1) P(i) for all i K. We say K is a mode of distribution {P(i)}.

Throughout this paper, when referring to the mode of a distribution, we mean the one with the largest index, if the distribution has multiple modes with equal probabilities.

Deﬁnition 6 (Strong Unimodality: [10]) A distribution {H(i)} is strongly unimodal if and only if the convolution of {H(i)} with any unimodal distribution {P(i)} is unimodal.

Since a distribution with all mass at zero is unimodal, a strongly unimodal distribution is also unimodal. In this paper, we only consider distributions with all support on the lattice of integers. So the convolution of {H(i)} and {P(i)} is

j= H(j)P(i j) =

j= H(i j)P(j)

We prove the strong unimodality of {R(j, i)} with respect to i via showing it is the convolution of two log-concave distributions by induction. We do an initial statistical query to estimate Pr[y = +1] to handle two marginal cases Pr[y = +1] ϵ/2 and Pr[y = +1] 1 ϵ/2. After that an additional query Pr[y = +1 | V x] is made to tell whether ℓ= L. If the algorithm doesn t halt, it means ℓ< L and both Pr[y = +1] and Pr[y = 1] are at least ϵ/2 2τ. By upper bounding Pr[y = +1] and Pr[y = 1] using linear sums of R(j, i), the strong unimodality of {R(j, i)} gives a lower bound for R(L, n), which further implies one for Q(L 1, n 1) and completes the proof.

3.3 A generalization to instance space Σ n

We have proved the extended class of shufﬂe ideals is PAC learnable from element-wise i.i.d. ﬁxedlength strings. Nevertheless, in many real-world applications such as natural language processing and computational linguistics, it is more natural to have strings of varying lengths. Let n be the maximum length of the sample strings and as a consequence the instance space for learning is Σ n. Here we show how to generalize the statistical query algorithm in Section 3.1 to the more general instance space Σ n.

Let Ai be the algorithm in Section 3.1 for learning shufﬂe ideals from element-wise i.i.d. strings of ﬁxed length i. Because instance space Σ n = S

i n Σi, we divide the sample S into n subsets {Si} where Si = {x | |x| = i}. An initial statistical query then is made to estimate probability Pr[|x| = i] for each i n at tolerance ϵ/(8n). We discard all subsets Si with query answer 3ϵ/(8n) in the learning procedure, because we know Pr[|x| = i] ϵ/(2n). As there are at most (n 1) such Si of low occurrence probabilities. The total probability that an instance comes from one of these negligible sets is at most ϵ/2. Otherwise, Pr[|x| = i] ϵ/(4n) and we apply algorithm Ai on each Si with query answer 3ϵ/(8n) with error parameter ϵ/2. Because the probability of the condition is polynomially large, the algorithm is feasible. Finally, the total error over the whole instance space will be bounded by ϵ and concept class X is PAC learnable from element-wise i.i.d. strings over instance space Σ n.

Corollary 1 Under element-wise independent and identical distributions over instance space I = Σ n, concept class X is approximately identiﬁable with O(sn2) conditional statistical queries from STAT(X, D) at tolerance

160sn2 + 8ϵ

or with O(sn2) statistical queries from STAT(X, D) at tolerance

τ = 1 ϵ 40sn2 + 2ϵ

512sn2(20sn2 + ϵ)

3.4 A constrained generalization to product distributions

A direct generalization from element-wise independent and identical distributions is product distributions. A random string, or a random vector of symbols under a product distribution has element-wise independence between its elements. That is, Pr[X = x] = Q|x| i=1 Pr[Xi = xi]. Although strings under product distributions share many independence properties with element-wise i.i.d. strings, the algorithm in Section 3.1 is not directly applicable to this case as the distribution {R(j, i)} deﬁned above is not unimodal with respect to i in general. However, the intuition that given IV x = h, the strings with xh+1 Uℓ+1 have higher probability of positivity than that of the strings with xh+1 Uℓ+1 is still true under product distributions. Thus we generalize query χV,a and deﬁne for any V (Σ ) n, a Σ and h [0, n 1],

χV,a,h(x, y) = 1

2(y + 1) given IV x = h and xh+1 = a

where y = c(x) is the label of example string x. To ensure the legitimacy and feasibility of the algorithm, we have to attach a lower bound assumption that Pr[xi = a] t > 0, for 1 i n and a Σ. Appendix C provides a constrained algorithm based on this intuition. Let P(+|a, h) denote IE χV,a,h. If the difference P(+|a+, h) P(+|a , h) is large enough for some h with nonnegligible Pr[IV x = h], then we are able to learn the next element in U. Otherwise, the difference is very small and we will show that there is an interval starting from index (h + 1) which we can skip with little risk. The algorithm is able to classify any string whose classiﬁcation process skips O(1) intervals. Details of this constrained generalization are deferred to Appendix C.

4 Learning principal shufﬂe ideals from Markovian strings

Markovian strings are widely studied in natural language processing and biological sequence modeling. Formally, a random string x is Markovian if the distribution of xi+1 only depends on the value of xi: Pr[xi+1 | x1 . . . xi] = Pr[xi+1 | xi] for any i 1. If we denote by π0 the distribution of x1 and deﬁne s s stochastic matrix M by M(a1, a2) = Pr[xi+1 = a1 | xi = a2], then a random string can be viewed as a Markov chain with initial distribution π0 and transition matrix M. We choose Σ n as the instance space in this section and assume independence between the string length and the symbols in the string. We assume Pr[|x| = k] t for all 1 k n and min{M( , ), π0( )} c for some positive t and c. We will prove the PAC learnability of class under this lower bound assumption. Denote by u be the target pattern string and let L = |u|.

4.1 Statistical query algorithm

Starting with empty string v = λ, the pattern string u is recovered one symbol at a time. Having recovered v = u[1, ℓ], we infer uℓ+1 by Ψv,a = Pn k=h+1 IEχv,a,k(x, y), where

χv,a,k(x, y) = 1

2(y + 1) given Iv x = h, xh+1 = a and |x| = k

0 ℓ< L and h is chosen from [0, n 1] such that the probability Pr[Iv x = h] is polynomially large. The statistical queries χv,a,k are made at tolerance τ claimed in Theorem 2 and the symbol with the largest query result of Ψv,a is proved to be uℓ+1. Again, the case where ℓ= L is addressed by query Pr[y = +1 | v x]. The learning procedure is completed if the query result is close to 1.

4.2 PAC learnability of principal ideal

With query Ψv,a, we are able to recover the pattern string u approximately from STAT( (u), D) at proper tolerance as stated in Theorem 2:

Theorem 2 Under Markovian string distributions over instance space I = Σ n, given Pr[|x| = k] t > 0 for 1 k n and min{M( , ), π0( )} c > 0, concept class is approximately identiﬁable with O(sn2) conditional statistical queries from STAT( , D) at tolerance

τ = ϵ 3n2 + 2n + 2

or with O(sn2) statistical queries from STAT( , D) at tolerance

(3n2 + 2n + 2)2

Please refer to Appendix B for a complete proof of Theorem 2. Due to the probability lower bound assumptions, the legitimacy and feasibility are obvious. To calculate the tolerance for PAC learning, we ﬁrst consider the exact learning tolerance. Let x be an inﬁnite string generated by the Markov chain deﬁned above. For any 0 ℓ L j, we deﬁne quantity Rℓ(j, i) by

Rℓ(j, i) = Pr[u[ℓ+1, ℓ+j] x [m+1, m+i] u[ℓ+1, ℓ+j] x [m+1, m+i 1] | x m = uℓ]

Intuitively, Rℓ(j, i) is the probability that the smallest g such that u[ℓ+ 1, ℓ+ j] x [m + 1, m + g] is i, given x m = uℓ. We have the following conclusion on the exact learning tolerance.

Lemma 3 Under Markovian string distributions over instance space I = Σ n, given Pr[|x| = k] t > 0 for 1 k n and min{M( , ), π0( )} c > 0, the concept class is exactly identiﬁable with O(sn2) conditional statistical queries from STAT( , D) at tolerance

τ = min 0 ℓ<L

k=h+1 Rℓ+1(L ℓ 1, k h 1)

The algorithm ﬁrst deals with the marginal case where P[y = +1] ϵ through query Pr[y = +1]. If it doesn t halt, we know Pr[y = +1] is at least (3n2 + 2n)ϵ/(3n2 + 2n + 2). We then make a statistical query χ h(x, y) = 1

2(y + 1) 1{Iv x=h} for each h from ℓto n 1. It can be shown that at least one h will give an answer (3n + 1)ϵ/(3n2 + 2n + 2). This implies lower bounds for Pr[Iv x = h] and Pr[y = +1 | Iv x = h]. The former guarantees the feasibility while the latter can serve as a lower bound for the sum in Lemma 3 after some algebra and completes the proof.

The assumption on M and π0 can be weakened to M(uℓ+1, uℓ) = Pr[x2 = uℓ+1 | x1 = uℓ] c and π0(u1) c for all 1 ℓ L 1. We ﬁrst make a statistical query to estimate M(a, uℓ) for ℓ 1 or π0(a) for ℓ= 0 for each symbol a Σ at tolerance c/3. If the result is 2c/3 then M(a, uℓ) c or π0(a) c and we won t consider symbol a at this position. Otherwise, M(a, uℓ) c/3 or π0(a) c/3 and the queries in the algorithm are feasible.

Corollary 2 Under Markovian string distributions over instance space I = Σ n, given Pr[|x| = k] t > 0 for 1 k n, π0(u1) c and M(uℓ+1, uℓ) c > 0 for 1 ℓ L 1, concept class is approximately identiﬁable with O(sn2) conditional statistical queries from STAT( , D) at tolerance

τ = min ϵ 3n2 + 2n + 2, c

or with O(sn2) statistical queries from STAT( , D) at tolerance

τ = min ctnϵ2

(3n2 + 2n + 2)2 , tnϵc2

3(3n2 + 2n + 2)

5 Learning shufﬂe ideals under general distributions

Although the string distribution is restricted or even known in most application scenarios, one might be interested in learning shufﬂe ideals under general unrestricted and unknown distributions without any prior knowledge. Unfortunately, under standard complexity assumptions, the answer is negative. Angluin et al. [3] have shown that a polynomial time PAC learning algorithm for principal shufﬂe ideals would imply the existence of polynomial time algorithms to break the RSA cryptosystem, factor Blum integers, and test quadratic residuosity.

Theorem 3 ([3]) For any alphabet of size at least 2, given two disjoint sets of strings S, T Σ n, the problem of determining whether there exists a string u such that u x for each x S and u x for each x T is NP-complete.

As ideal is a subclass of ideal X, we know learning ideal X is only harder. Is the problem easier over instance space Σn? The answer is again no.

Lemma 4 Under general unrestricted string distributions, a concept class is PAC learnable over instance space Σ n if and only if it is PAC learnable over instance space Σn.

The proof of Lemma 4 is presented in Appendix D using the same idea as our generalization in Section 3.3. Note that Lemma 4 holds under general string distributions. It is not necessarily true when we have assumptions on the marginal distribution of string length.

Despite the infeasibility of PAC learning a shufﬂe ideal in theory, it is worth exploring the possibilities to do the classiﬁcation problem without theoretical guarantees, since most applications care more about the empirical performance than about theoretical results. For this purpose we propose a heuristic greedy algorithm for learning principal shufﬂe ideals based on reward strategy as follows. Upon having recovered v = bu[1, ℓ], for a symbol a Σ and a string x of length n, we say a consumes k elements in x if min{Iva x, n + 1} Iv x = k. The reward strategy depends on the ratio r+/r : the algorithm receives r reward from each element it consumes in a negative example or r+ penalty from each symbol it consumes in a positive string. A symbol is chosen as buℓ+1 if it brings us most reward. The algorithm will halt once bu exhausts any positive example and makes a false negative error, which means we have gone too far. Finally the ideal (bu[1, ℓ 1]) is returned as the hypothesis. The performance of this greedy algorithm depends a great deal on the selection of parameter r+/r . A clever choice is r+/r = #( )/#(+), where #(+) is the number of positive examples x such that bu x and #( ) is the number of negative examples x such that bu x. A more recommended but more complex strategy to determine the parameter r+/r in practice is cross validation.

A better studied approach to learning regular languages, especially the piecewise-testable ones, in recent works is kernel machines ([12, 13]). An obvious advantage of kernel machines over our greedy method is its broad applicability to general classiﬁcation learning problems. Nevertheless, the time complexity of the kernel machine is O(N 3 + n2N 2) on a training sample set of size N ([5]), while our greedy method only takes O(sn N) time due to its great simplicity. Because N is usually huge for the demand of accuracy, kernel machines suffer from low efﬁciency and long running time in practice. To make a comparison between the greedy method and kernel machines for empirical performance, we conducted a series of experiments on a real world dataset [4] with string length n as a variable. The experiment results demonstrate the empirical advantage on both efﬁciency and accuracy of the greedy algorithm over the kernel method, in spite of its simplicity. As this is a theoretical paper, we defer the details on the experiments to Appendix D, including the experiment setup and ﬁgures of detailed experiment results.

6 Discussion

We have shown positive results for learning shufﬂe ideals in the statistical query model under element-wise independent and identical distributions and Markovian distributions, as well as a constrained generalization to product distributions. It is still open to explore the possibilities of learning shufﬂe ideals under less restricted distributions with weaker assumptions. Also a lot more work needs to be done on approximately learning shufﬂe ideals in applications with pragmatic approaches. In the negative direction, even a family of regular languages as simple as the shufﬂe ideals is not efﬁciently properly PAC learnable under general unrestricted distributions unless RP=NP. Thus, the search for a nontrivial properly PAC learnable family of regular languages continues. Another theoretical question that remains is how hard the problem of learning shufﬂe ideals is, or whether PAC learning a shufﬂe ideal is as hard as PAC learning a deterministic ﬁnite automaton.

Acknowledgments

We give our sincere gratitude to Professor Dana Angluin of Yale University for valuable discussions and comments on the learning problem and the proofs. Our thanks are also due to Professor Joseph Chang of Yale University for suggesting supportive references on strong unimodality of probability distributions and to the anonymous reviewers for their helpful feedback.

[1] D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337 350, 1978.

[2] D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75(2):87 106, Nov. 1987.

[3] D. Angluin, J. Aspnes, S. Eisenstat, and A. Kontorovich. On the learnability of shufﬂe ideals. Journal of Machine Learning Research, 14:1513 1531, 2013.

[4] K. Bache and M. Lichman. NSF research award abstracts 1990-2003 data set. UCI Machine Learning Repository, 2013.

[5] L. Bottou and C.-J. Lin. Support vector machine solvers. Large scale kernel machines, pages 301 320, 2007.

[6] N. H. Bshouty. Exact learning of formulas in parallel. Machine Learning, 26(1):25 41, Jan. 1997.

[7] C. de la Higuera. A bibliographical study of grammatical inference. Pattern Recognition, 38(9):1332 1348, Sept. 2005.

[8] B. Gnedenko and A. N. Kolmogorov. Limit distributions for sums of independent random variables. Addison-Wesley series in statistics, 1949.

[9] E. M. Gold. Complexity of automaton identiﬁcation from given data. Information and Control, 37(3):302 320, 1978.

[10] I. Ibragimov. On the composition of unimodal distributions. Theory of Probability and Its Applications, 1(2):255 260, 1956.

[11] M. Kearns. Efﬁcient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, Nov. 1998.

[12] L. A. Kontorovich, C. Cortes, and M. Mohri. Kernel methods for learning languages. Theoretical Computer Science, 405(3):223 236, Oct. 2008.

[13] L. A. Kontorovich and B. Nadler. Universal kernel-based learning with applications to regular languages. The Journal of Machine Learning Research, 10:1095 1129, June 2009.

[14] K. Koskenniemi. Two-level model for morphological analysis. Proceedings of the Eighth International Joint Conference on Artiﬁcial Intelligence - Volume 2, pages 683 685, 1983.

[15] M. Mohri. On some applications of ﬁnite-state automata theory to natural language processing. Journal of Natural Language Engineering, 2(1):61 80, Mar. 1996.

[16] M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269 311, June 1997.

[17] M. Mohri, P. J. Moreno, and E. Weinstein. Efﬁcient and robust music identiﬁcation with weighted ﬁnite-state transducers. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):197 207, Jan. 2010.

[18] M. Mohri, F. Pereira, and M. Riley. Weighted ﬁnite-state transducers in speech recognition. Computer Speech and Language, 16(1):69 88, 2002.

[19] L. Pitt and M. K. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the ACM (JACM), 40(1):95 142, Jan. 1993.

[20] O. Rambow, S. Bangalore, T. Butt, A. Nasr, and R. Sproat. Creating a ﬁnite-state parser with application semantics. Proceedings of the 19th International Conference on Computational Linguistics - Volume 2, pages 1 5, 2002.

[21] R. Sproat, W. Gale, C. Shih, and N. Chang. A stochastic ﬁnite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3):377 404, Sept. 1996.

[22] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, Nov. 1984.