# probabilistic_rule_realization_and_selection__dd80f5eb.pdf Probabilistic Rule Realization and Selection Haizi Yu Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 haiziyu7@illinois.edu Tianxi Li Department of Statistics University of Michigan Ann Arbor, MI 48109 tianxili@umich.edu Lav R. Varshney Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Urbana, IL 61801 varshney@illinois.edu Abstraction and realization are bilateral processes that are key in deriving intelligence and creativity. In many domains, the two processes are approached through rules: high-level principles that reveal invariances within similar yet diverse examples. Under a probabilistic setting for discrete input spaces, we focus on the rule realization problem which generates input sample distributions that follow the given rules. More ambitiously, we go beyond a mechanical realization that takes whatever is given, but instead ask for proactively selecting reasonable rules to realize. This goal is demanding in practice, since the initial rule set may not always be consistent and thus intelligent compromises are needed. We formulate both rule realization and selection as two strongly connected components within a single and symmetric bi-convex problem, and derive an efficient algorithm that works at large scale. Taking music compositional rules as the main example throughout the paper, we demonstrate our model s efficiency in not only music realization (composition) but also music interpretation and understanding (analysis). 1 Introduction Abstraction is a conceptual process by which high-level principles are derived from specific examples; realization, the reverse process, applies the principles to generalize [1,2]. The two, once combined, form the art and science in developing knowledge and intelligence [3, 4]. Neural networks have recently become popular in modeling the two processes, with the belief that the neurons, as distributed data representations, are best organized hierarchically in a layered architecture [5,6]. Probably the most relevant such examples are auto-encoders, where the cascaded encoder and decoder respectively model abstraction and realization. From a different angle that aims for interpretability, this paper first defines a high-level data representation as a partition of the raw input space, and then formalizes abstraction and realization as bi-directional probability inferences between the raw inputs and its high-level representations. While abstraction and realization is ubiquitous among knowledge domains, this paper embodies the two as theory and composition in music, and refers to music high-level representations as compositional rules. Historically, theorists [7,8] devised rules and guidelines to describe compositional Equal contribution. Supported in part by the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR), a research collaboration as part of the IBM Cognitive Horizons Network. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. regularities, resulting in music theory that serves as the formal language to speak of music style and composers decisions. Automatic music theorists [9 11] have also been recently developed to extract probabilistic rules in an interpretable way. Both human theorists and auto-theorists enable teaching of music composition via rules such as avoiding parallel octaves and resolving tendency tones. So, writing music, to a certain extent (e.g. realizing a part-writing exercise), becomes the process of generating legitimate music realizations that satisfy the given rules. This paper focuses on the realization process in music, assuming rules are given by a preceding abstraction step. There are two main challenges. First, rule realization: problem occurs when one asks for efficient and diverse music generation satisfying the given rules. Depending on the rule representation (hard or probabilistic), there are search-based systems that realize hard-coded rules to produce music pieces [12,13], as well as statistical models that realize probabilistic rules to produce distributions of music pieces [9,14]. Both types of realizations typically suffer from the enormity of the sample space, a curse of input dimensionality. Second, rule selection (which is subtler): not all rules are equally important nor are they always consistent. In some cases, a perfect and all-inclusive realization is not possible, which requires relaxation/sacrifice of some rules. In other cases, composers intentionally break certain rules to establish unique styles. So the freedom and creativity in selecting the right rules for realization poses the challenge. The main contribution of the paper is to propose and implement a unified framework that makes reasonable rule selections and realizes them in an efficient way, tackling the two challenges in one shot. As one part of the framework, we introduce a two-step dimensionality reduction technique a group de-overlap step followed by a screening step to efficiently solve music rule realization. As the other part, we introduce a group-level generalization of the elastic net penalty [15] to weight the rules for a reasonable selection. The unified framework is formulated as a single bi-convex optimization problem (w.r.t. a probability variable and a weight variable) that coherently couples the two parts in a symmetric way. The symmetry is beneficial in both computation and interpretation. We run experiments on artificial rule sets to illustrate the operational characteristics of our model, and further test it on a real rule set that is exported from an automatic music theorist [11], demonstrating the model s selectivity in music rule realization at large scale. Although music is the main case study in the paper, we formulate the problem in generality so the proposed framework is domain-agnostic and applicable anywhere there are rules (i.e. abstractions) to be understood. Detailed discussion at the end of the paper demonstrates that the framework applies directly to general real-world problems beyond music. In the discussion, we also emphasize how our algorithm is non-trivial, not just a simple combinatorial massaging of standard models. Therefore, the techniques introduced in this paper offer broader algorithmic takeaways and are worth further studying in the future. 2 The Formalism: Abstraction, Realization, and Rule Abstraction and Realization We restrict our attention to raw input spaces that are discrete and finite: X = {x1, . . . , xn}, and assume the raw data is drawn from a probability distribution p X , where the subscript refers to the sample space (not a random variable). We denote a high-level representation space (of X) by a partition A (of X) and its probability distribution by p A. Partitioning the raw input space gives one way of abstracting low-level details by grouping raw data into clusters and ignoring within-cluster variations. Following this line of thought, we define an abstraction as the process: (X, p X ) (A, p A) for some high-level representation A, where p A is inferred from p X by summing up the probability masses within each partition cluster. Conversely, we define a realization as the process: (A, p A) (X, p X ), where p X is any probability distribution that infers p A. Probabilistic Compositional Rule To put the formalism in the context of music, we first follow the convention [9] to approach a music piece as a sequence of sonorities (a generic term for chord) and view each moment in a composition as determining a sonority that fits the existing music context. If we let Ωbe a finite collection of pitches specifying the discrete range of an instrument, e.g. the collection of the 88 keys on a piano, then a k-part sonority k simultaneously sounding pitches is a point in Ωk. So X = Ωk is the raw input space containing all possible sonorities. Although discrete and finite, the raw input size is typically large, e.g. |X| = 884 considering piano range and 4-part chorales. Therefore, theorists have invented various music parameters such as quality and inversion, to abstract specific sonorities. In this paper, we inherit the approach in [11] to formalize a high- level representation of X by a feature-induced partition A, and call the output of the corresponding abstraction (A, p A) a probabilistic compositional rule. Probabilistic Rule System The interrelation between abstraction and realization (X, p X ) (A, p A) can be formalized by a linear equation: Ap = b, where A {0, 1}m n represents a partition (Aij = 1 if and only if xj is assigned to the ith cluster in the partition), and p = p X , b = p A are probability distributions of the raw input space and the high-level representation space, respectively. In the sequel, we represent a rule by the pair (A, b), so realizing this rule becomes solving the linear equation Ap = b. More interestingly, given a set of rules: (A(1), b(1)), . . . , (A(K), b(K)), the realization of all of them involves finding a p such that A(r)p = b(r), for all r = 1, . . . , K. In this case, we form a probabilistic rule system by stacking all rules into one single linear system: A(1) ... A(K) {0, 1}m n, b = b(1) ... b(K) [0, 1]m. (1) We call A(r) i,: p = b(r) i a rule component, and mr = dim(b(r)) the size (# of components) of a rule. 3 Unified Framework for Rule Realization and Selection In this section, we detail a unified framework for simultaneous rule realization and selection. Recall rules themselves can be inconsistent, e.g. rules learned from different music contexts can conflict. So given an inconsistent rule system, we can only achieve Ap b. To best realize the possibly inconsistent rule system, we solve for p n by minimizing the error Ap b 2 2 = P r A(r)p b(r) 2 2, the sum of the Brier scores from every individual rule. This objective does not differentiate rules (or their components) in the rule system, which typically yields a solution that satisfies all rules approximately and achieves a small error on average. This performance, though optimal in the averaged sense, is somewhat disappointing since most often no rule is satisfied exactly (error-free). Contrarily, a human composer would typically make a clear separation: follow some rules exactly and disregard others even at the cost of a larger realization error. The decision made on rule selection usually manifests the style of a musician and is a higher level intelligence that we aim for. In this pursuit, we introduce a fine-grained set of weights w m to distinguish not only individual rules but also their components. The weights are estimates of relative importance, and are further leveraged for rule selection. This yields a weighted error, which is used herein to measure realization quality: E(p, w; A, b) = (Ap b) diag(w)(Ap b). (2) If we revisit the two challenges mentioned in Sec. 1, we see that under the current setting, the first challenge concerns the curse of dimensionality for p, while the second concerns the selectivity for w. We introduce two penalty terms, one each for p and w, to tackle the two challenges, and propose the following bi-convex optimization problem as the unified framework: minimize E(p, w; A, b) + λp Pp(p) + λw Pw(w) (3) subject to p n, w m. Despite contrasting purposes, both penalty terms, Pp(p) and Pw(w), adopt the same high-level strategy of exploiting group structures in p and w. Regarding the curse of dimensionality, we exploit the group structure of p by grouping pj and pj together if the jth and j th columns of A are identical, partitioning p s coordinates into K groups: g 1, . . . , g K where K is the number of distinct columns of A. This grouping strategy uses the fact that in a simplex-constrained linear system, we cannot determine the individual pjs within each group but only their sum. We later show (Sec. 4.1) the resulting group structure of p is essential in dimensionality reduction (when K n ) and has a deeper interpretation regarding abstraction levels. Regarding the rule-level selectivity, we exploit the group structure of w by grouping weights together if they are associated with the same rule, partitioning w s coordinates into K groups: g1, . . . , g K where K is the number of given rules. Based on the group structures of p and w, we introduce their corresponding group penalties as follows: Pp(p) = pg 1 2 1 + + pg K 2 1, (4) P w(w) = m1 wg1 1 2 + + m K wg K 1 2. (5) One can see the symmetry here: group penalty (4) on p is a squared, unweighted L2,1-norm, which is designed to secure a unique solution that favors more randomness in p for the sake of diversity in sonority generation [9]; group penalty (5) on w is a weighted L1,2-norm (group lasso), which enables rule selection. However, there is a pitfall of the group lasso penalty when deployed in Problem (3): the problem has multiple global optima that are indefinite about the number of rules to pick (e.g. selecting one rule and ten consistent rules are both optimal). To give more control over the number of selections, we finalize the penalty on w as the group elastic net that blends between a group lasso penalty and a ridge penalty: Pw(w) = αP w(w) + (1 α) w 2 2, 0 α 1, (6) where α balances the trade-off between rule elimination (less rules) and selection (more rules). Model Interpretation Problem (3) is a bi-convex problem: fixing p it is convex in w; fixing w it is convex in p. The symmetry between the two optimization variables further gives us the reciprocal interpretations of the rule realization and selection problem: given p, the music realization, we can analyze its style by computing w; given w, the music style, we can realize it by computing p and further sample from it to obtain music that matches the style. The roles of the hyperparameters λp and (λw, α) are quite different. In setting λp sufficiently small, we secure a unique solution for the rule realization part. However, for the rule selection part, what is more interesting is that adjusting λw and α allows us to guide the overall composition towards different directions, e.g. conservative (less strictly obeyed rules) versus liberal (more loosely obeyed rules). Model Properties We state two properties of the bi-convex problem (3) as the following theorems whose proofs can be found in the supplementary material. Both theorems involve the notion of group selective weight. We say w m is group selective if for every rule in the rule set, w either drops it or selects it entirely, i.e. either wgr = 0 or wgr > 0 element-wisely, for any r = 1, . . . , K. For a group selective w, we further define suppg(w) to be the selected rules, i.e. suppg(w) = {r | wgr > 0 element-wisely} {1, . . . , K}. Theorem 1. Fix any λp > 0, α [0, 1]. Let (p (λw), w (λw)) be a solution path to problem (3). (1) w (λw) is group selective, if λw > 1/α. (2) w gr(λw) 2 mr/m as λw , for r = 1, . . . , K. Theorem 2. For λp = 0 and any λw > 0, α [0, 1], let (p , w ) be a solution to problem (3). We define C 2{1,...,K} such that any C C is a consistent (error-free) subset of the given rule set. If suppg(w ) C, then P r suppg(w ) mr = max P r C mr | C C . Thm. 1 implies a useful range of the λw-solution path: if λw is too large, w will converge to a known value that always selects all the rules; if λw is too small, w can lose the guarantee to be group selective. This further suggests the termination criteria used later in the experiments. Thm. 2 considers rule selection in the consistent case, where the solution selects the largest number of rule components among all other consistent rule selections. Despite the condition λp = 0, in practice, this theorem suggests one way of using model for a small λp: if the primary interest is to select consistent rules, the model is guaranteed to pick as many rule components as possible (Sec. 5.1). Yet, a more interesting application is to slightly compromise consistency to achieve better selection (Sec. 5.2). 4 Alternating Solvers for Probability and Weight It is natural to solve the bi-convex problem (3) by iteratively alternating the update of one optimization variable while fixing the other, yielding two alternating solvers. 4.1 The p-Solver: for Rule Realization If we fix w, the optimization problem (3) boils down to: minimize E(p, w; A, b) + λp Pp(p) (7) subject to p n. G = {G1, G2, G3} De O(G) = {G0 8 > > > > > > > > > > < > > > > > > > > > > : (1, 0, 0), x 2 G0 1 (0, 1, 0), x 2 G0 2 (0, 0, 1), x 2 G0 3 (0, 1, 1), x 2 G0 4 (1, 0, 1), x 2 G0 5 (1, 1, 0), x 2 G0 6 (1, 1, 1), x 2 G0 Figure 1: An example of group de-overlap. Making a change of variable: qk = 1 pg k = pg k 1 for k = 1, . . . , K and letting q = (q1, . . . , q K ), problem (7) is transformed to its reduced form: minimize E(p, w; A , b) + λp q 2 2 (8) subject to q K , where A is obtained from A by removing its column duplicates. Problem (8) is a convex problem with a strictly convex objective, so it has a unique solution q . However, the solution to the original problem (7) may not be unique: any p satisfying q k = 1 p g k is a solution to (7). To favor a more random p (as discussed in Sec. 3), we can uniquely determine p by uniformly distributing the probability mass qk within the group g k: p g k = (qk/ dim(pg k))1, k = 1, . . . , K . Dimensionality Reduction: Group De-Overlap Problem (7) is of dimension n, while its reduced form (8) is of dimension K ( n) from which we can attain dimensionality reduction. In cases where K n, we have a huge speed-up for the p-solver; in other cases, there is still no harm to always run the p-solve from the reduced problem (8). Recall that we have achieved this type of dimensionality reduction by exploiting the group structure of p purely from a computational perspective (Sec. 3). However, the resulting group structure has a deeper interpretation regarding abstraction levels, which is closely related to the concept of de-overlapping a family of groups, group de-overlap in short. (Group De-Overlap) Let G = {G1, . . . , Gm} be a family of groups (a group is a non-empty set), and G = m i=1Gi. We introduce a group assignment function g : G 7 {0, 1}m, such that for any x G, g(x)i = 1{x Gi}, and further introduce an equivalence relation on G: x x if g(x) = g(x ). We then define the de-overlap of G, another family of groups, by the quotient space De O(G) = {G 1, . . . , G m } := G/ . (9) The idea of group de-overlap is simple (Fig. 1), and De O(G) indeed comprises non-overlapping groups, since it is a partition of G that equals the set of equivalence classes under . Now given a set of rules (A(1), b(1)), . . . , (A(K), b(K)), we denote their corresponding high-level representation spaces by A(1), . . . , A(K), each of which is a partition of the raw input space X (Sec. 2). Let G = K k=1A(k), then De O(G) is a new partition hence a new high-level representation space of G = X, and is finest (may be tied) among all partitions A(1), . . . , A(K). Therefore, De O(G), as a summary of the rule system, delimits a lower bound on the level of abstraction produced by the given set of rules/abstractions. What coincides with De O(G), is the group structure of p (recall: pj and pj are grouped together if the jth and j th columns of A are identical), since for any xj X, the jth column of A is precisely the group assignment vector g(xj). Therefore, the decomposed solve step from q to p reflects the following realization chain: n (A(1), p A(1)), . . . , (A(K), p A(K)) o (De O(G), q ) (X, p X ), (10) where the intermediate step not only computationally achieves dimensionality reduction, but also conceptually summarizes the given set of abstractions and is further realized in the raw input space. Note that the σ-algebra of the probability space associated with (8) is precisely generated by De O(G). When rules are inserted into a rule system sequentially (e.g. the growing rule set from an automatic music theorist), the successive solve of (8) is conducted along a σ-algebra path that forms a filtration: nested σ-algebras that lead to finer and finer delineations of the raw input space. In a pedagogical setting, the filtration reflects the iterative refinements of music composition from high-level principles that are taught step by step. Dimensionality Reduction: Screening We propose an additional technique for further dimensionality reduction when solving the reduced problem (8). The idea is to perform screening, which quickly identifies the zero components in q and removes them from the optimization problem. Leveraging DPC screening for non-negative lasso [16], we introduce a screening strategy for solving a general simplex-constrained linear least-squares problem (one can check problem (8) is indeed of this form): minimize Xβ y 2 2, subject to β 0, β 1 = 1. (11) We start with the following non-negative lasso problem, which is closely related to problem (11): minimize φλ(β) := Xβ y 2 2 + λ β 1, subject to β 0, (12) and denote its solution by β (λ). One can show that if β (λ ) 1 = 1, then β (λ ) is a solution to problem (11). Our screening strategy for problem (11) runs the DPC screening algorithm on the non-negative lasso problem (12), which applies a repeated screening rule (called EDPP) to solve a solution path specified by a λ-sequence: λmax = λ0 > λ1 > . The ℓ1-norms along the solution path are non-decreasing: 0 = β (λ0) 1 β (λ1) 1 . We terminate the solution path at λt if β (λt) 1 1 and β (λt 1) 1 < 1. Our goal is to use β (λt) to predict the zero components in β (λ ), a solution to problem (11). More specifically, we assume that the zero components in β (λt) are also zero in β (λ ), hence we can remove those components from β (also the corresponding columns of X) in problem (11) and reduce its dimensionality. While in practice this assumption is usually true provided that we have a delicate solution path, the monotonicity of β (λ) s support along the solution path does not hold in general [17]. Nevertheless, the assumption does hold when β (λt) 1 1, since the solution path is continuous and piecewise linear [18]. Therefore, we carefully design a solution path in the hope of a β (λt) whose ℓ1-norm is close to 1 (e.g. let λi = γλi 1 with a large γ (0, 1), while more sophisticated design is possible such as a bi-section search). To remedy the (rare) situations where β (λt) predicts some incorrect zero components in β (λ ), one can always leverage the KKT conditions of problem (11) as a final check to correct those mis-predicted components [19]. Finally, note that the screening strategy may fail when the ℓ1-norms along the solution path converge to a value less than 1. In these cases we can never find a desired λt with β (λt) 1 1. In theory, such failure can be avoided by a modified lasso problem which in practice does not improve efficiency much (see the supplementary material). 4.2 The w-Solver: for Rule Selection If we fix p, the optimization problem (3) boils down to: minimize E(p, w; A, b) + λw Pw(w) (13) subject to w m. We solve problem (13) via ADMM [20]: w(k+1) = arg min w e w + λw Pw(w) + ρ 2 w z(k) + u(k) 2 2, (14) z(k+1) = arg min z I m(z) + ρ 2 w(k+1) z + u(k) 2 2, (15) u(k+1) = u(k) + w(k+1) z(k+1). (16) In the w-update (14), we introduce the error vector e = (Ap b)2 (element-wise square), and obtain a closed-form solution by a soft-thresholding procedure [21]: for r = 1, . . . , K, w(k+1) gr = 1 λwα mr (ρ + 2λw(1 α)) e(k) gr 2 + e(k) gr , where e(k) = ρ(z(k) u(k)) e ρ + 2λw(1 α) . (17) In the z-update (15), we introduce the indicator function I m(z) = 0 if z m and otherwise, and recognize it as a (Euclidean) projection onto the probability simplex: z(k+1) = Π m(w(k+1) + u(k)), (18) which can be solved efficiently by a non-iterative method [22]. Given that ADMM enjoys a linear convergence rate in general [23] and the problem s dimension m n, one execution of the wsolver is cheaper than that of the p-solver. Indeed, the result from the w-solver can speed up the subsequent execution of the p-solver, since we can leverage the zero components in w to remove the corresponding rows in A, yielding additional savings in the group de-overlap of the p-solver. rule 1 rule 2 rule 3 rule 4 rule 5 8 6 4 2 0 2 0 (a) Case A1: α = 0.8. 10 1 rule 1 rule 2 rule 3 rule 4 rule 5 8 6 4 2 0 2 0 (b) Case A2: α = 0.8. Figure 2: The λw-solution paths obtained from the two artificial rule sets. Each path is depicted by the trajectories of the group norms (top) and the trajectory of the weighted errors (bottom). 5 Experiments 5.1 Artificial Rule Set We generate two artificial rule sets: Case A1 and A2, both of which are derived from the same raw input space X = {x1, . . . , xn} for n = 600, and comprise K = 5 rules. The rules in Case A1 are of size 80, 50, 60, 60, 60, respectively; the rules in Case A2 are of size 70, 50, 65, 65, 65, respectively. For both cases, rule 1&2 and rule 3&4 are the only two consistent sub rule sets of size 2. The main difference between the two cases is: in Case A1, rule 1&2 has a combined size of 130 which is larger than rule 3&4 and in Case A2 it is opposite. Under different settings of the hyperparameters λw and α, our model selects different rule combinations exhibiting unique personal styles. Tuning the blending factor α [0, 1] is relatively easy, since it is bounded and has a nice interpretation. Intuitively, if α 0, the effect of the group lasso vanishes, yielding a solution w that is not selective; if α 1, the group elastic net penalty reduces to the group lasso, exposing the pitfall mentioned in Sec. 3. Experiments show that if we fix a small α, the model picks either all five rules or none; if we fix a large α, the group norms associated with each rule are highly unstable as λw varies. Fortunately in practice, α has a wide middle range (typically between 0.4 and 0.9), within which all corresponding λw-solution paths look similar and perform stable rule selection. Therefore, for all experiments herein, we fix α = 0.8 and study the behavior of the corresponding λw-solution path. We show the λw-solution paths in Fig. 2. Along the path, we plot the group norms (top, one curve per rule) and the weighted errors (bottom). The former, formulated as w gr(λw) 2, describes the options for rule selection; the latter, formulated as E(p (λw), w (λw); A, b), describes the quality of rule realization. To produce the trajectories, we start with a moderate λw (e.g. λw = 1), and gradually increase and decrease its value to bi-directionally grow the curves. We terminate the descending direction when w (λw) is not group selective and terminate the ascending direction when the group norms converge. Both terminations are indicated by Thm. 1, and work well in practice. As λw grows, the model transitions its compositional behavior from a conservative style (sacrifice a number of rules for accuracy) towards a more liberal one (sacrifice accuracy for more rules). If we further focus on the λws that give us zero weighted error, Fig. 2a reveals rule 1&2, and Fig. 2b reveals rule 3&4, i.e. the largest consistent subset of the given rule set in both cases (Thm. 2). Finally, we mention the efficiency of our algorithm. Averaged over several runs on multiple artificial rule sets of the same size, the run-time of our solver is 27.2 5.5 seconds, while that of a generic solver (CVX) is 41.4 3.8 seconds. We attribute the savings to the dimensionality reduction techniques introduced in Sec. 4.1, which will be more significant at large scale. 5.2 Real Compositional Rule Set As a real-world application, we test our unified framework on rule sets from an automatic music theorist [11]. The auto-theorist teaches people to write 4-part chorales by providing personalized rule 3 rule 6 rule 9 rule 10 others 8 6 4 2 0 2 0 log2(λw) selected rule set # of rules # of rule components [ 12, 6] {10} 1 1540 [ 5, 2] {3, 6, 10} 3 1699 [ 1, 0] {3, 6, 9, 10} 4 2154 1 {3, 6, 8, 9, 10, 11, 13} 7 2166 2 {1, 3, 7, 9, 10, 11, 13} 7 2312 3 all 16 2417 Figure 3: The λw-solution path obtained from a real compositional rule set. rules at every stage of composition. In this experiment, we exported a set of 16 compositional rules which aims to guide a student in writing the next sonority that follows well with the existing music content. Each voice in a chorale is drawn from Ω= {R, G1, . . . , C6} that includes the rest (R) and 54 pitches (G1 to C6) from human vocal range. The resulting raw input space X = Ω4 consists of n = 554 107 sonorities, whose distribution lives in a very high dimensional simplex. This curse of dimensionality typically fails most of the generic solvers in obtaining an acceptable solution within a reasonable amount of time. We show the λw-solution path associated with this rule set in Fig. 3. Again, the general trend shows the same pattern here: the model turns into a more liberal style (more rules but less accurate) as λw increases. Along the solution path, we also observe that the consistent range (i.e. the error-free zone) is wider than that in the artificial cases. This is intuitive, since a real rule set should be largely consistent with minor contradictions, otherwise it will confuse the student and lose its pedagogical purpose. A more interesting phenomenon occurs when the model is about to leave the error-free zone. When log2(λw) goes from 1 to 2, the combined size of the selected rules increases from 2166 to 2312 but the realization error increases only a little. Will sacrificing this tiny error be a smarter decision to make? The difference between the selected rules at these two moments shows that rule 1 and 7 were added into the selection at log2(λw) = 2 replacing rule 6 and 8. Rule 1 is about the bass line, while rule 6 is about tenor voice. It is known in music theory that outer voices (soprano and bass) are more characteristic and also more identifiable than inner voices (alto and tenor) which typically stay more or less stationary as background voices. So it is understandable that although larger variety in the bass increases the opportunity for inconsistency (in this case not too much), it is a more important rule to keep. Rule 7 is about the interval between soprano and tenor, while rule 8 describes a small feature between the upper two voices but does not have a meaning yet in music theory. So unlike rule 7 that brings up the important concept of voicing (i.e. classifying a sonority into open/closed/neutral position), rule 8 could simply be a miscellaneous artifact. To conclude, in this particular example, we would argue that the rule selection happens at log2(λw) = 2 is a better decision, in which case the model makes a good compromise on exact consistency. To compare a selective rule realization with its non-selective counterpart [11], we plot the errors A(r)p b(r) 2 for each rule r = 1, . . . , 16 as histograms in Fig. 4. The non-selective realization takes all rules into consideration with equal importance, which turns out to be a degenerate case along our model s solution path for log2(λw) . This realization yields a well-balanced solution but no rules are satisfied exactly. In constrast, a selective realization (e.g. log2(λw) = 1) gives near-zero errors on selected rules, producing more human-like compositional decisions. 1 4 5 6 7 8 9 10 11 12 13 14 15 16 3 2 rule rule (a) selective (b) non-selective 1 4 5 6 7 8 9 10 11 12 13 14 15 16 3 2 Figure 4: Comparison between a selective rule realization (log2(λw) = 1) and its non-selective counterpart. The boldfaced x-tick labels designate the indices of the selected rules. 6 Discussion Generality of the Framework The formalism of abstraction and realization in Sec. 2, as well as the unified framework for simultaneous rule realization and selection in Sec. 3, is general and domain-agnostic, not specific to music. The problem formulation as a bi-convex problem (3) admits numerous real-world applications that can be cast as (quasi-)linear systems, possibly equipped with some group structure. For instance, many problems in physical science involve estimating unknowns x from their observations y via a linear (or linearized) equation y = Ax [24], where a grouping of yi s (say, from a single sensor or sensor type) itself summarizes x as a rule/abstraction. In general, the observations are noisy and inconsistent due to errors from the measuring devices or even the failure of a sensor. It is then necessary to assign a different reliability weight to every individual sensor reading, and ask for a selective algorithm to realize the readings respecting the group structure. So in cases where some devices fail and give inconsistent readings, we can run the proposed algorithm to filter them out. Linearity versus Expressiveness The linearity with respect to p in the rule system Ap = b results directly from adopting the probability-space representation. However, this does not imply that the underlying domain (e.g. music) is as simple as linear. In fact, the abstraction process can be highly nonlinear which involves hierarchical partitioning of the input space [11]. So, instead of running the risk of losing expressiveness, the linear equation Ap = b hides the model complexity in the A matrix. On the other hand, the linearity with respect to w in the bi-convex objective (3) is a design choice. We start with a simple linear model to represent relative importance for the sake of interpretability, which may sacrifice the model s expressiveness like other classic linear models. To push the boundary of this trade off in the future, we will pursue more expressive models without compromising (practically important) interpretability. Differences from (Group) Lasso Component-wise, both subproblems (7) and (13) of the unified framework look similar to regular feature selection settings such as lasso [25] and group lasso [26]. However, not only does the strong coupling between the two subproblems exhibit new properties (Thm. 1 and 2), but also the differences in the formulation present unique algorithmic challenges. First, the weighted error term (2) in the objective is in stark contrast with the regular regression formulation where (group) lasso is paired with least-squares or other similar loss functions. Whereas dropping features in a regression model typically increases training loss (under-fitting), dropping rules, on the contrary, helps drive the error to zero since a smaller rule set is more likely to achieve consensus. Hence, the tendency to drop rules in a regular (group) lasso is against the pursuit of a largest consistent rule set as desired. This stresses the necessity of a more carefully designed penalty like our proposed group elastic net. Second, the additional simplex constraint weakens the grouping property of group lasso: failures in group selection (i.e. there exists a rule that is not entirely selected) are observed for small λws. The simplex constraint, effectively an ℓ1 constraint, also incurs an ℓ1 cancellation , which nullifies a simple lasso (also an ℓ1) on a simple parameterization of the rules (one weight per rule). These differences pose new model behaviors and deserve further study. Local Convergence We solve the bi-convex problem (3) via alternating minimizations in which the algorithm decreases the non-negative objective in every iteration thus assures its convergence. Nevertheless, neither a global optimum nor a convergence in solution can be guaranteed. The former leaves the local convergence susceptible to different initializations, demanding further improvements through techniques such as random start and noisy updates. The latter leaves the possibility for the optimization variables to enter a limit cycle. However, we consider this as an advantage, especially in music where one prefers multiple realizations and interpretations that are equally optimal. More Microscopic Views The weighting scheme in this paper presents the rule selection problem in a most general setting, where a different weight is assigned to every rule component. Hence, we can study the relative importance not only between rules by the group norms wgr 2, but also within every single rule. The former compares compositional rules in a macroscopic level, e.g. restricting to a diatonic scale is more important than avoiding parallel octaves; while the latter in a microscopic level, e.g. changing the probability mass within a diatonic scale creates variety in modes: think about C major versus A minor. We can further study the rule system microscopically by sharing weights of the same component but from different rules, yielding an overlapping group elastic net. [1] K. Lewin, Field Theory in Social Science. Harpers, 1951. [2] J. Skorstad, D. Gentner, and D. Medin, Abstraction processes during concept learning: A structural view, in Proc. 10th Annu. Conf. Cognitive Sci. Soc., 1988, pp. 419 425. [3] K. Haase, Discovery systems: From AM to CYRANO, MIT AI Lab Working Paper 293, 1987. [4] A. M. Barry, Visual Intelligence: Perception, Image, and Manipulation in Visual Communication. SUNY Press, 1997. [5] Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798 1828, 2013. [6] Y. Bengio, Deep learning of representations: Looking forward, in Proc. Int. Conf. Stat. Lang. and Speech Process., 2013, pp. 1 37. [7] J. J. Fux, Gradus ad Parnassum. Johann Peter van Ghelen, 1725. [8] H. Schenker, Kontrapunkt. Universal-Edition A.G., 1922. [9] H. Yu, L. R. Varshney, G. E. Garnett, and R. Kumar, MUS-ROVER: A self-learning system for musical compositional rules, in Proc. 4th Int. Workshop Music. Metacreation (MUME 2016), 2016. [10] , Learning interpretable musical compositional rules and traces, in Proc. 2016 ICML Workshop Hum. Interpret. Mach. Learn. (WHI 2016), 2016. [11] H. Yu and L. R. Varshney, Towards deep interpretability (MUS-ROVER II): Learning hierarchical representations of tonal music, in Proc. 5th Int. Conf. Learn. Represent. (ICLR 2017), 2017. [12] D. Cope, An expert system for computer-assisted composition, Comput. Music J., vol. 11, no. 4, pp. 30 46, 1987. [13] K. Ebcio glu, An expert system for harmonizing four-part chorales, Comput. Music J., vol. 12, no. 3, pp. 43 51, 1988. [14] J. R. Pierce and M. E. Shannon, Composing music by a stochastic process, Bell Telephone Laboratories, Technical Memorandum MM-49-150-29, Nov. 1949. [15] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B. Methodol., vol. 67, no. 2, pp. 301 320, 2005. [16] J. Wang and J. Ye, Two-layer feature reduction for sparse-group lasso via decomposition of convex sets, in Proc. 28th Annu. Conf. Neural Inf. Process. Syst. (NIPS), 2014, pp. 2132 2140. [17] T. Hastie, J. Taylor, R. Tibshirani, and G. Walther, Forward stagewise regression and the monotone lasso, Electron. J. Stat., vol. 1, pp. 1 29, 2007. [18] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, Ann. Stat., vol. 32, no. 2, pp. 407 499, 2004. [19] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani, Strong rules for discarding predictors in lasso-type problems, J. R. Stat. Soc. Ser. B. Methodol., vol. 74, no. 2, pp. 245 266, 2012. [20] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1 122, 2011. [21] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B. Methodol., vol. 68, no. 1, pp. 49 67, 2006. [22] W. Wang and M. A. Carreira-Perpinán, Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application, ar Xiv:1309.1541 [cs.LG], 2013. [23] M. Hong and Z.-Q. Luo, On the linear convergence of the alternating direction method of multipliers, Math. Program., pp. 1 35, 2012. [24] D. D. Jackson, Interpretation of inaccurate, insufficient and inconsistent data, Geophys. J. Int., vol. 28, no. 2, pp. 97 109, 1972. [25] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B. Methodol., pp. 267 288, 1996. [26] J. Friedman, T. Hastie, and R. Tibshirani, A note on the group lasso and a sparse group lasso, 2010.