# permuted_and_augmented_stickbreaking_bayesian_multinomial_regression__b57a471e.pdf Journal of Machine Learning Research 18 (2018) 1-33 Submitted 7/17; Revised 12/17; Published 4/18 Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression Quan Zhang quan.zhang@mccombs.utexas.edu Mingyuan Zhou mingyuan.zhou@mccombs.utexas.edu Department of Information, Risk, and Operations Management Mc Combs School of Business The University of Texas at Austin Austin, TX 78712, USA Editor: David M. Blei To model categorical response variables given their covariates, we propose a permuted and augmented stick-breaking (pa SB) construction that one-to-one maps the observed categories to randomly permuted latent sticks. This new construction transforms multinomial regression into regression analysis of stick-specific binary random variables that are mutually independent given their covariate-dependent stick success probabilities, which are parameterized by the regression coefficients of their corresponding categories. The pa SB construction allows transforming an arbitrary cross-entropy-loss binary classifier into a Bayesian multinomial one. Specifically, we parameterize the negative logarithms of the stick failure probabilities with a family of covariate-dependent softplus functions to construct nonparametric Bayesian multinomial softplus regression, and transform Bayesian support vector machine (SVM) into Bayesian multinomial SVM. These Bayesian multinomial regression models are not only capable of providing probability estimates, quantifying uncertainty, increasing robustness, and producing nonlinear classification decision boundaries, but also amenable to posterior simulation. Example results demonstrate their attractive properties and performance. Keywords: Discrete choice models, logistic regression, nonlinear classification, softplus regression, support vector machines 1. Introduction Inferring the functional relationship between a categorical response variable and its covariates is a fundamental problem in physical and social sciences. To address this problem, it is common to use either multinomial logistic regression (MLR) (Mc Fadden, 1973; Greene, 2003; Train, 2009) or multinomial probit regression (Albert and Chib, 1993; Mc Culloch and Rossi, 1994; Mc Culloch et al., 2000; Imai and van Dyk, 2005), both of which can be expressed as a latent-utility-maximization model that lets an individual make the decision by comparing its random utilities across all categories at once. In this paper, we address the problem via a new stick-breaking construction of the multinomial distribution, which defines a one-to-one random mapping between the category and stick indices. Rather than assuming an individual compares its random utilities across all categories at once, we assume an individual makes a sequence of stick-specific binary random decisions. The choice c 2018 Quan Zhang and Mingyuan Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-409.html. Zhang and Zhou of the individual is the category mapped to the stick that is the first to choose 1, or the category mapped to stick S if all the first S 1 sticks choose 0. This framework transforms the problem of regression analysis of categorical variables into the problem of inferring the one-to-one mapping between the category and stick indices, and performing regression analysis of binary stick-specific random variables. Both MLR and the proposed stick-breaking models link a categorical response variable to its covariate-dependent probability parameters. While MLR is invariant to the permutation of category labels, given a fixed category-stick mapping, the proposed stick-breaking models purposely destruct that invariance. We are motivated to introduce this new framework for discrete choice modeling mainly to facilitate efficient Bayesian inference via data augmentation, introduce nonlinear decision boundaries, and relax a well-recogonized restrictive model assumption of MLR, as described below. An important motivation is to extend efficient Bayesian inference available to binary regression to multinomial one. In the proposed stick-breaking models, the binary stickspecific random variables of an individual are conditionally independent given their stickspecific covariate-dependent probabilities. Under this setting, one can solve a multinomial regression by solving conditionally independent binary ones. The only requirement is that the underlying binary regression model uses the cross entropy loss. In other words, we require each stick-specific binary random variable to be linked via the Bernoulli distribution to its corresponding stick-specific covariate-dependent probability parameter. Another important motivation is to improve the model capacity of MLR, which is a linear classifier in the sense that if the total number of categories is S, then MLR uses the intersection of S 1 linear hyperplanes to separate one class from the others. By choosing nonlinear binary regression models, we are able to enhance the capacities of the proposed stick-breaking models. We are also motivated to relax the independence of irrelevant alternative (IIA) assumption, an inherent property of MLR that requires the probability ratio of any two choices to be independent of the presence or characteristics of any other alternatives (Mc Fadden, 1973; Greene, 2003; Train, 2009). By contrast, the proposed stickbreaking models make the probability ratio of two choices depend on other alternatives, as long as the two sticks that both choices are mapped to are not next to each other. In light of these considerations, we will first extend the softplus regressions recently proposed in Zhou (2016), a family of cross-entropy-loss binary classifiers that can introduce nonlinear decision boundaries and can recover logistic regression as a special case, to construct Bayesian multinomial softplus regressions (MSRs). We then consider a multinomial generalization of the widely used support vector machine (SVM) (Boser et al., 1992; Cortes and Vapnik, 1995; Sch olkopf et al., 1999; Cristianini and Shawe-Taylor, 2000), a max-margin binary classifier that uses the hinge loss. While there has been significant effort in extending binary SVMs into multinomial ones (Crammer and Singer, 2002; Lee et al., 2004; Liu and Yuan, 2011), the resulted extensions typically only provide the predictions of deterministic class labels. By contrast, we extend the Bayesian binary SVMs in Sollich (2002) and Mallick et al. (2005) under the proposed framework to construct Bayesian multinomial SVMs (MSVMs), which naturally provide predictive class probabilities. We will show that the proposed Bayesian MSRs and MSVMs, which all generalize the stick-breaking construction to perform Bayesian multinomial regression, are not only capable of placing nonlinear decision boundaries between different categories, but also amenable Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression to posterior simulation via data augmentation. Another attractive feature shared by all these proposed Bayesian algorithms is that they can not only predict class probabilities but also quantify model uncertainty. In addition, we will show that robit regression, a robust cross-entropy-loss binary classifier proposed in Liu (2004), can be extended into a robust Bayesian multinomial classifier under the proposed stick-breaking construction. The remainder of the paper is organized as follows. In Section 2 we briefly review MLR and discuss the restrictions of its stick-breaking construction. In Section 3 we propose the permuted and augmented stick breaking (pa SB) to construct Bayesian multi-class classifiers, present the inference, and show how the IIA assumption is relaxed. Under the pa SB framework, we show how to transform softplus regressions and support vector machines into Bayesian multinomial regression models in Sections 4 and 5, respectively. We provide experimental results in Section 6 and conclude the paper in Section 7. 2. Multinomial Logistic Regression and Stick Breaking In this section we first briefly review multinomial logistic regression (MLR). We then use the stick-breaking construction to show how to generate a categorical random variable as a sequence of dependent binary variables, and further discuss a naive approach to transform binary logistic regression under stick breaking into multinomial regression. In the following discussion, we use i {1, . . . , N} to index the individual/observation, s {1, . . . , S} to index the choice/category, and the prime symbol to denote the transpose operation. 2.1 Multinomial Logistic Regression MLR that parameterizes the probability of each category given the covariates as P(yi = s | xi, {βs}1,S) = pis, pis = ex iβs PS j=1 ex iβj (1) is widely used, where xi RP+1 consists of xi1 = 1 and P covariates, and βs RP+1 consists of the regression coefficients for the sth category (Mc Cullagh and Nelder, 1989; Albert and Chib, 1993; Holmes and Held, 2006). Without loss of generality, one may choose category S as the reference category by setting all the elements of βS as 0, making ex iβS = 1 almost surely (a.s.). For MLR, if data i is assigned to the category with the largest pis, then one may consider that category s resides within a convex polytope (Gr unbaum, 2013), defined by the set of solutions to S 1 inequalities as x (βj βs) 0, where j {1, . . . , s 1, s + 1, . . . , S}. Despite its popularity, MLR is a linear classifier in the sense that it uses the intersection of S 1 linear hyperplanes to separate one class from the others. As a classical discrete choice model in econometrics, it makes the independence of irrelevant alternatives (IIA) assumption, implying that the unobserved factors for choice making are both uncorrelated and having the same variance across all alternatives (Mc Fadden, 1973; Train, 2009). Moreover, while its log-likelihood is convex and there are efficient iterative algorithms to find the maximum likelihood or maximum a posteriori solutions of βs, the absence of conjugate priors on βs makes it difficult to derive efficient Bayesian inference. For Bayesian inference, Polson et al. (2013) have introduced the P olya-Gamma data augmentation for logit models, and combined it with the data augmentation technique of Holmes and Held (2006) for the Zhang and Zhou multinomial likelihood to develop a Gibbs sampling algorithm for MLR. This algorithm, however, has to update βs one at a time while conditioning on all βj for j = s. Thus it may not only lead to slow convergence and mixing, especially when the number of categories S is large, but also prevent us from parallelizing the sampling of {βs}1,S within each MCMC iteration. 2.2 Stick Breaking Suppose yi is a random variable drawn from a categorical distribution with a finite vector of probability parameters (pi1, . . . , pi S), where S < , pis 0, and PS s=1 pis = 1. Instead of directly using yi PS s=1 pisδs, one may consider generating yi using the multinomial stick-breaking construction that sequentially draws binary random variables bis {bij}j 0 while pi2 = 1 + ex iβ1 1 1 + e x iβ2 1 is possible to be larger than 50% only if both x iβ1 < 0 and x iβ2 > 0. We will use an example to illustrate this type of geometric constraints in Section 6.1. Under the logistic stick-breaking construction, not only could the performance be sensitive to how the S different categories are ordered, but the imposed geometric constraints could also be overly restrictive even if the categories are appropriately ordered. Below we address the first issue by introducing a permuted and augmented stick-breaking representation for a multinomial model, and the second issue by adding the ability to model nonlinearity. 3. Permuted and Augmented Stick Breaking To turn the seemingly undesirable sensitivity of the stick-breaking construction to label permutation into a favorable model property, when label asymmetry is desired, and mitigate performance degradation, when label symmetry is desired, we introduce a permuted and augmented stick-breaking (pa SB) construction for a multinomial distribution, making it straightforward to extend an arbitrary binary classifier with cross entropy loss into a Zhang and Zhou Bayesian multinomial one. The pa SB construction infers a one-to-one mapping between the labels of the S categories and the indices of the S latent sticks, transforming the problem from modeling a multinomial random variable into modeling S conditionally independent binary ones. It not only allows for parallel computation within each MCMC iteration, but also improves the mixing of MCMC in comparison to the one used in Polson et al. (2013), which updates one regression-coefficient vector conditioning on all the others, as will be shown in Section 6.5. Note that the number of distinct one-to-one label-stick mappings is S!, which quickly becomes too large to exhaustively search for the best mapping as S increases. Our experiments will show that the proposed MCMC algorithm can quickly escape from a purposely poorly initialized mapping and subsequently switch between many different mappings that all lead to similar performance, suggesting an effective search space that is considerably smaller than S!. 3.1 Category-Stick Mapping and Data Augmentation The proposed pa SB construction randomly maps a category to one and only one of the S latent sticks and makes the augmented Bernoulli random variables {bis}1,S conditionally independent to each other given {πis}1,S. Denote z = (z1, . . . , z S) as a permutation of (1, . . . , S), where zs {1, . . . , S} is the index of the stick that category s is mapped to. Given the label-stick mapping z, let us denote pis(z) as the multinomial probability of category s, and πizs(xi, βs) as the covariate-dependent stick probability that is associated with the covariates of observation i and the stick that category s is mapped to. For notational convenience, we will write πizs(xi, βs) as πizs and πij(xi, βs:zs=j) as πij. We emphasize that here the sth regression-coefficient vector βs is always associated with both category s and the corresponding stick probabily πizs, a construction that will facilitate the inference of the label-stick mapping z. The following Theorem shows how to generate a categorical random variable of S categories with a set of S conditionally independent Bernoulli random variables. This is key to transforming the problem from solving multinomial regression into solving S binary regressions independently. Theorem 1 Suppose yi PS s=1 pis(z)δs, where [pi1(z), . . . , pi S(z)] is a multinomial probability vector whose elements are constructed as pis(z) = (πizs)1(zs =S) Y j zyi, whose conditional posteriors given yi and πij remain the same as their priors. These changes are key to appropriately ordering the latent sticks, more flexibly parameterizing πizs and hence pis(z), and maintaining tractable inference. With pa SB, the problem of inferring the functional relationship between the categorical response yi and the corresponding covariates xi is now transformed into the problem of modeling S conditionally independent binary regressions as bizs | xi, βs Bernoulli[πizs(xi, βs)], i = 1, . . . , N, s = 1, . . . , S. Note that the only requirement for the binary regression model under pa SB is that it uses the Bernoulli likelihood. In other words, it uses the cross entropy loss (Murphy, 2012) as i=1 ln P(bizs | xi, βs) = bizs ln πizs(xi, βs) (1 bizs) ln[1 πizs(xi, βs)] . A basic choice is pa SB logistic regression that lets πizs(xi, βs) = 1/(1 + e xiβs), which becomes the same as the logistic stick breaking construction described in Section 2.3 if zs = s for all s {1, . . . , S}. Another choice is pa SB-robit regression that extends robit regression of Liu (2004), a robust binary classifier using cross entropy loss, into a robust Bayesian multinomial classifier. In robit regression, observation i is labeled as 1 if x iβ+εi > 0 and as 0 otherwise, where εi are independently drawn from a t-distribution with κ degrees of freedom, denoted as εi iid tκ. Consequently, the conditional class probability function of robit regression is P(yi = 1 | xi, β) = Fκ(x iβ), where Fκ is the cumulative density function of tκ. The robustness is attributed to the heavy-tail property of Fκ(x iβ), which, if κ < 7, imposes less penalty than the conditional class probability function of logistic regression does on misclassified observations that are far from the decision boundary. Applying Theorem 1, the category probability of pa SB-robit regression with κ degrees of freedom is shown in (4), where πizs(xi, βs) = Fκ(x iβs). The pa SB-robit regression provides a simple solution to robust multiclass classification; with {bij}i,j defined in Theorem 1, we run independent binary robit regressions using the Gibbs sampler proposed in Liu (2004). In addition to pa SB, we define permuted and augmented reverse stick breaking (par SB) in the following Corollary. Corollary 2 Suppose yi PS s=1 pisδs and pis(z) = (1 πizs)1(zs =S) Q j zyi)Bernoulli(πij), for j = 1, . . . , S 1, and let bi S = 1(zyi = S). This means we let bij = 0 if j < zyi, bij = 1 if j = zyi, draw bij from Bernoulli(πij) if zyi < j < S, and let bi S = 1 if and only if zyi = S. Note that stick S is used as a reference stick and πi S is not used in defining pis(z) in (4). Despite having no impact on computing {pis}1,S, we infer πi S (i.e., sample the regression-coefficient vector βs :zs =S) under the likelihood QN i=1 Bernoulli(bi S; πi S) and use it in a Metropolis-Hastings step, as described in (8) shown below, to decide whether to switch the mappings of two different categories, if one of which is mapped to the reference stick S. Once we have an MCMC sample of {bij}1,S, we then essentially solve independently S binary classification problems, the jth of which can be expressed as bij | xi, βs:zs=j Bernoulli[πij(xi, βs:zs=j)]. Analogously, for par SB, {bij}1,S can be sampled as (bij | yi, πij, z) 1(j < zyi) + 1(j > zyi)Bernoulli(πij) for j = 1, . . . , S 1, and bi S = 1 1(zyi = S), which means we let bij = 1 if j < zyi, let bij = 0 if j = zyi, draw bij from Bernoulli(πij) if zyi < j < S, and let bi S = 0 if and only if zyi = S. Since stick-breaking multinomial classification is not invariant to the permutation of its class labels, it may perform substantially worse than it could be if the inherent geometric constraints implied by the current ordering of the labels make it difficult to adapt the decision boundaries to the data. Our solution to this problem is to infer the one-to-one mapping between the category labels and stick indices from the data. We construct a Metropolis-Hastings (MH) step within each Gibbs sampling iteration, with a proposal of switching two sticks that categories c and c , 1 c < c S, are mapped to, by changing the current category-stick one-to-one mapping from z = (z1, . . . , zc, . . . , zc , . . . , z S) to z = (z 1, . . . , z S) := (z1, . . . , zc , . . . , zc, . . . , z S). Assuming a uniform prior on z and proposing (c, c ) uniformly at random from one of the S 2 = S(S 1)/2 possibilities, we would accept the proposal with probability QS s=1[pis(z )]1(yi=s) QS s=1[pis(z)]1(yi=s) , 1 QS s=1 h (πiz s)1(z s =S) Q j 0, where β(t) sk RP+1, given the covariate vector xi and category-stick mapping z, MSR parameterizes pis, the multinomial probability of category s, under the pa SB construction as pis(z) = 1 Q k=1 1+ex iβ(T+1) sk ln n 1+ex iβ(T ) sk ln h 1+. . . ln 1+ex iβ(2) sk io rsk 1(zs =S) 1+ex iβ(T+1) jk ln 1+ex iβ(T ) jk ln 1+. . . ln 1+ex iβ(2) jk rjk # and parameterizes pis under the par SB construction as pis(z) = Q k=1 1+ex iβ(T+1) sk ln n 1+ex iβ(T ) sk ln h 1+. . . ln 1+ex iβ(2) sk io rsk 1(zs =S) 1+ex iβ(T+1) jk ln 1+ex iβ(T ) jk ln 1+. . . ln 1+ex iβ(2) jk rjk # Zhang and Zhou For the convenience of implementation, we truncate the number of atoms of the gamma process at K by choosing a discrete base measure for each category as Gs0 = PK k=1 γs0 K δβ(2:T +1) sk , under which we have rsk Gamma(γs0/K, 1/cs0) as the prior distribution for the weight of expert k in category s. For each category, we expect only some of its K experts to have non-negligible weights if K is set large enough, and we may use P k 1 P i m(1) isk > 0 , where m(1) isk is defined in (11), to measure the number of active experts inferred from the data. 4.2 Geometric Constraints for MSR Since by definition we have pis(z) = πizs 1 P jzs are. To motivate the use of the seemingly over-parameterized sum-stack-softplus function in (9), we first consider the simplest case of K = T = 1. Without loss of generality, let us assume that the category-stick mapping is fixed at z = (1, . . . , S). Lemma 4 For pa SB-MSR with K = T = 1 and z = (1, . . . , S), the set of solutions to pis(z) > p0 in the covariate space are bounded by a convex polytope defined by the intersection of s linear hyperplanes. Note that the binary softplus regression with K = T = 1 is closely related to logistic regression, and reduces to logistic regression if r = 1 (Zhou, 2016). With Lemma 4, it is clear that even if an optimal category-stick mapping z is provided, pa SB-MSR with K = T = 1 may still clearly underperform MLR. This is because category s uses a single hyperplane to separate itself from the remaining S s categories, and hence uses the interaction of at most s hyperplanes to separate itself from the other S 1 categories. By contrast, MLR uses a convex polytope bounded by at most S 1 hyperplanes for each of the S categories. When K > 1 and/or T > 1, an exact theoretical analysis is beyond the scope of this paper. Instead we provide some qualitative analysis by borrowing related geometric-constraint analysis for softplus regressions in Zhou (2016). Note that Equation (10) indicates that a noisy-or model (Pearl, 2014; Srinivas, 1993), commonly appearing in causal inference, is used at each step of the sequential one-vs-remaining decision process; at each step, the binary outcome of an observation is attributed to the disjunctive interaction of many possible hidden causes. Roughly speaking, to enclose category s to separate it from the remaining S s categories in the covariate space, pa SB-MSR with K > 1 and T = 1 uses the complement of a convex-polytope-bounded space, pa SB-MSR with K = 1 and T > 1 uses a convexpolytope-like confined space, and pa SB-MSR with both K > 1 and T > 1 uses a union of convex-polytope-like confined spaces. For par SB-MSR with K + T > 1, the interpretation is the same except a convex polytope in pa SB will be replaced with the complement of a convex polytope, and vise versa. In contrast to SVMs using the kernel trick, MSRs using the original covariates might be more appealing in research areas, like biostatistics and sociology, where the interpretation of regression coefficients and investigation of causal relationships are of interest. In addition, we find that the classification capability of MSRs could be further enhanced with data transformation, as will be discussed in Section 6.4. Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression 5. Bayesian Multinomial Support Vector Machine Support vector machines (SVMs) are max-margin binary classifiers that typically minimize a regularized hinge loss objective function as i=1 max(1 bix iβ, 0) + νR(β), where bi { 1, 1} represents the binary label for the ith observation, R(β) is a regularization function that is often set as the L1 or L2 norm of β, ν is a tuning parameter, and x i is the ith row of the design matrix X = (x1, . . . , xn) . For linear SVMs, xi is the covariate vector of the ith observation, whereas for nonlinear SVMs, one typically set the (i, j)th element of X as the kernel distance between the covariate vector of the ith observation and the jth support vector. The decision boundary of a binary SVM is {x : x β = 0} and an observation is assigned the label yi = sign(x β), which means bi = 1 if x β 0 and bi = 1 if x β < 0. 5.1 Bayesian Binary SVMs It is shown in Polson and Scott (2011) that the exponential of the negative of the hinge loss can be expressed as a location-scale mixture of normals as L(bi | xi, β) = exp 2 max(1 bix iβ, 0) 1 2πωi exp 1 2 (1 + ωi bix iβ)2 Consequently, L(b | X, β) = Q i L(bi | xi, β) = exp { 2 P i max(1 bix iβ, 0)} can be regarded as a pseudo likelihood in the sense that it is unnormalized with respect to b = (b1, . . . , b N) { 1, 1}N. This location-scale normal mixture representation of the hinge loss allows developing close-form Gibbs sampling update equations for the regression coefficients β via data augmentation, as discussed in detail in Polson and Scott (2011) and further generalized in Henao et al. (2014) to construct nonlinear SVMs amenable to Bayesian inference. While data augmentation has made it feasible to develop Bayesian inference for SVMs, it has not addressed a common issue that SVMs provide the predictions of deterministic class labels but not class probabilities. For this reason, below we discuss how to allow SVMs to predict class probabilities while maintaining tractable Bayesian inference via data augmentation. Following Sollich (2002) and Mallick et al. (2005), by defining the joint distribution of β and {xi}i to be proportional to Q i[L(1 | xi, β) + L( 1 | xi, β)], one may define the conditional distribution of the binary label bi { 1, 1} as P(bi | xi, β) = 1 1 + e 2bix iβ , for |x iβ| 1; 1 1 + e bi[x iβ+sign(x iβ)] , for |x iβ| > 1; (12) which defines a probabilistic inference model that has the same maximum a posteriori (MAP) solution as that of a binary SVM for a given data set. Note that for MAP inference, Zhang and Zhou the penalty term νR(β) of the regularized hinge loss can be related to a corresponding prior distribution imposed on β, such as Gaussian, Laplace, and spike-and-slab priors (Polson and Scott, 2011). 5.2 pa SB Multinomial Support Vector Machine Generalizing previous work in constructing Bayesian binary SVMs, we propose multinomial SVM (MSVM) under the pa SB framework that is distinct from previously proposed MSVMs (Crammer and Singer, 2002; Lee et al., 2004; Liu and Yuan, 2011). A Bayesian MSVM that predicts class probabilities has also been proposed before in Zhang and Jordan (2006), which, however, does not have a data augmentation scheme to sample the regression coefficients in closed form, and consequently, relies on a random-walk Metropolis-Hastings procedure that may be difficult to tune. Redefining the label sample space from bi { 1, 1} to bi {0, 1}, we may rewrite (12) as bi | xi, β Bernoulli[πi, svm(xi, β)], where πi, svm(xi, β) = 1 1 + e 2xiβ , for |x iβ| 1; 1 1 + e xiβ sign(x iβ) , for |x iβ| > 1. (13) The Bernoulli likelihood based cross-entropy-loss binary classifier, whose covariate-dependent probabilities are parameterized as in (13), is exactly what we need to extend the binary SVM into a multinomial classifier under pa SB introduced in Theorem 1. More specifically, given the category-stick mapping z, with the success probabilities of the stick that category s is mapped to parameterized as πizs, svm(xi, βs) and binary stick variables drawn as bizs Bernoulli[πizs, svm(xi, βs)], we have the following definition. Definition 2 (pa SB multinomial SVM) Under the pa SB construction, given the covariate vector xi and category-stick mapping z, multinomial support vector machine (MSVM) parameterizes pis, the multinomial probability of category s, as pis(z) = [πizs, svm(xi, βs)]1(zs =S) Y j:zj 1 allows using a single (if K = 1 as in the second row) or a union (if K > 1 as in the fourth row) of convex-polytope-like confined spaces to separate one category from the others (by enclosing the positively labeled observations in each stick-specific binary classification task). The results in Figure 1 show that even an unoptimized category-stick mapping, which is unfavorable to MSR with small K and/or T, is enforced, empowering each stick-specific binary regression model with a higher capacity (using larger K and/or T) can still allow MSR to achieve excellent separations. It is also simple to show that for the data set in Figure 1, even if one chooses low-capacity stick-specific binary regression models by setting T = 1, one can still achieve good performance with MSR if the category-stick mapping is set as z = (2, 1, 3), z = (3, 1, 2), z = (2, 3, 1), or z = (3, 2, 1). That is to say, as long as it is not category 1 (blue points) that is mapped to stick 1, MSR with T = 1 is able to provide satisfactory performance. 6.2 Influence of Category-Stick Mapping and its Inference The Iris data set in Figure 1 provides an instructive example to show not only the importance of increasing the model capacity if a poor category-stick mapping is imposed, but also the importance of optimizing the category-stick mapping if the capacities of these stick-specific binary regression models are limited. To further illustrate the benefits of inferring an appropriate category-stick mapping z, we consider the square data set shown in Figure 2. Zhang and Zhou Figure 1: Log-likelihood plots and predictive probability heat maps for the 2-D iris data with a fixed category-stick mapping z = (1, 2, 3). Blue, black, and gray points are labeled as categories 1, 2, and 3, respectively. For the first row, K = 1 and T = 1, second row, K = 1 and T = 3, third row, K = 5 and T = 1, and fourth row, K = 5 and T = 3. The log-likelihood plots are shown in Column 1, and the predictive probability heat maps of categories 1 (blue), 2 (black), and 3 (gray) are shown in Columns 2, 3, and 4, respectively. We show that for MSR, even if both K and T are sufficiently large to allow each stick-specific binary regression model to have a high enough capacity, whether an optimal category-stick mapping is selected may still clearly matter for the performance. As shown in the first three rows of Figure 2, with K = T = 10, three different z s are considered and z = (1, 2, 3) (shown in the first row) is found to perform the best. As shown in the fourth row, we sample z using (8) within each MCMC iteration and achieve a result that seems as good as fixing z = (1, 2, 3). In fact, we find that our inferred mappings switch between z = (1, 2, 3) and z = (1, 3, 2) during MCMC iterations, indicating that the Markov chain is mixing well. These results suggest the importance of both learning the mapping z from the data and allowing the stick-specific binary classifiers to have enough capacities to model nonlinear classification decision boundaries. When sampling z = (z1, . . . , z S) that the S categories are mapped to, although S! permutations of (1, . . . , S) can become enormous as S increases, the effective search space could be much smaller if many different mappings imply similar likelihoods and if these extremely poor mappings can be easily avoided. Rather than searching for the best mapping, Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression Figure 2: Log-likelihood plots and predictive probability heat maps for the square data with K = T = 10. The blue, black, and gray points are labeled as categories 1, 2, and 3, respectively. We fix the category-stick mapping as z = (1, 2, 3) for Row 1, (2, 1, 3) for Row 2, and (3, 1, 2) for Row 3, and sample z for Row 4. The log-likelihood plots are shown in Column 1, and the predictive probability heat maps of categories 1 (blue), 2 (black), and 3 (gray) are shown in Columns 2, 3, and 4, respectively. the proposed MH step, proposing two indices zj and zj to switch in each iteration, is a simple but effective strategy to escape from the mappings that lead to poor fits. Note that the probability of a zj not being proposed to switch after t MCMC iterations is [(S 2)/S]t. Even if S is as large as 100, this probability is less than 10 8 at t = 1000. Also note the iteration at which zj is proposed to switch at the first time follows a geometric distribution, with success probability 2/S. Thus S/2 is the expected number of iterations for a zj to be proposed to switch once. To demonstrate the efficiency of our permutation scheme, we construct square101, a synthetic two-dimensional data set consisting of 101 categories. We generate 8000 data points that are uniformly at random distributed within the 12 12 spatial region occupied by all 101 categories. The decision boundaries of different classes are displayed in Figure 3(a), where the data points placed within the outside square frame, whose outer and inner dimensions are 12 and 10, respectively, are assigned to category 1, and these placed within the sth unit square, where s {2, . . . , 101}, inside the square frame are assigned to category s. Zhang and Zhou Although it is almost impossible to search for the best category-stick mapping z giving rise to the highest likelihood from all 101! 10160 possible mappings, we show our permutation scheme is very effective in escaping from poor mappings, leading to a performance that is comparable to the best of those obtained with pre-fixed suboptimal mappings. More specifically, applying the analysis in Section 4.2 to Figure 3(a), we expect an a SB-MSR to perform well under a fixed suboptimal category-stick mapping z, where z1 = 1, which means the outside square frame is mapped to stick 1, and the squares closer to the inner boundary of the square frame are mapped to the sticks broken at earlier stages; the mapping z = (1, 2, , 101) is such an example. In other words, we first separate the frame from all the other squares, and then sequentially separate the squares from the remainders; the closer a square is from the frame, the earlier it is separated. The total number of suboptimal mappings z s constructed in this manner is as large as 36! 28! 20! 12! 4! 1099.5. First, we uniformly at random generate 3600 different suboptimal mappings z s under this construction, run a SB-MSR with K = T = 4, and plot the histogram of the 3600 log-likelihoods in Figure 3(b). Second, we start from 3600 randomly initialized z, run pa SB-MSR with K = T = 4, and also plot the histogram of the 3600 log-likelihoods in Figure 3(b). For each run, we choose 20,000 MCMC iterations and collect the last 1000 MCMC samples. Each log-likelihood is averaged over those of the corresponding model s collected MCMC samples. As in Figure 3(b), the log-likelihood from a pa SB-MSR is in general clearly larger than that of an a SB-MSR with a fixed suboptimal z, and there is little overlap between their corresponding histograms. Further examining the 3600 z s inferred by pa SB at its last MCMC iteration shows that 3482 of them have z1 = 1 and all of them have z1 5. Suppose z1 / {1, 2, 3, 4, 5} at the current iteration, which means category 1 is mapped to none of the first five sticks, then the probability of not only selecting stick z1, but also switching it with one of the first five sticks in the MH proposal is 1 101 5 100. Thus the probability that category 1 has never been proposed to mapped to one of the first five sticks after t iterations is [1 5/(101 100)]t, which becomes as small as 0.005% at t = 20, 000, demonstrating the effectiveness of our permutation scheme in dealing with a large number of categories. Note we have also tried 3600 a SB-MSR, each of which is provided with a randomly initialized z. The log-likelihoods, however, are all far below 4000 and hence not included for comparison. This phenomenon is not surprising, as the probability for a randomly initialized z to be suboptimal is as tiny as 36! 28! 20! 12! 4!/101! 10 60.5. Figure 4 empirically demonstrates the effectiveness of permuting z on the satimage data set, using MSRs with K = 5, T = 3, and z fixed at each of the 6! = 720 possible one-to-one category-stick mappings. Panels (a) and (b) show the log-likelihood histograms for MSRs constructed under augmented SB (a SB) and augmented reversed SB (ar SB), respectively. Both histograms are clearly left skewed, indicating under both a SB and ar SB, only a small proportion of the 720 different category-stick mappings lead to very poor fits. The blue vertical lines at 1203.82 in (a) and 1350.21 in (b) are the log-likelihoods by pa SB and par SB, respectively, in both of which the category-stick mapping z is updated by a MH step in each MCMC iteration. Only 20 (97) out of 720 a SB-MSRs (ar SB-MSRs) have a higher likelihood than pa SB-MSR (par SB-MSR). Since in the stick-breaking construction, the binary classifier that separates a category mapped to a smaller-indexed stick from the others utilizes fewer constraints, the classification can be poor if the complexity of the decision boundary goes beyond the nonlinear Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression 4200 4000 3800 3600 3400 log likelihood Figure 3: (a) Illustration of the square101 data and (b) log-likelihood histograms, by a SBMSR with 3600 random suboptimal category-stick mappings and by pa SB-MSR with 3600 randomly initialized category-stick mappings. 1500 1400 1300 1200 log likelihood 1800 1700 1600 1500 1400 1300 log likelihood Figure 4: Log-likelihood histograms for MSRs using all 720 possible category-stick mappings, constructed under (a) augmented stick breaking (a SB) and (b) augmented and reversed stick breaking (ar SB). The blue lines in (a) and (b) correspond to the log-likelihoods of pa SB-MSR and par SB-MSR, respectively. modeling capacity of the binary classifier. However, even with a low-capacity binary classifier, the performance could be significantly improved if that difficult-to-separate category is mapped to a larger-indexed stick, for which there are fewer categories left to be separated in its one-vs-remaining binary classification problem. Examining the z s associated with the 100 lowest log-likelihoods in Figure 4, we find there are 51 mappings belonging to the set {z : z5 = 1 or z6 = 1} in a SB, and 77 belonging to {z : z3 = 1 or z6 = 1} in ar SB. It suggests that separating Categories 5 or 6 (Categories 3 or 6) from all the other categories might be beyond the capacity of a binary softplus regression with K = 5 and T = 3 under Zhang and Zhou 0 50 100 150 200 250 G G G G G G 1 2 3 4 5 6 7 8 9 10 G category 1 category 2 category 3 Figure 5: Inferred expert weights rk in descending order for each category of the square data with K = T = 10. the a SB (ar SB) construction. But if breaking the sticks associated with these categories at late stages, we only need to separate them from fewer remaining categories, which could be much easier. We have further examined the other 620 arrangements, and found no evident patterns. These observations suggest that the effective search space of the mapping z is considerably smaller than S!, and the proposed MH step is effective in escaping from poor category-stick mappings. In pa SB-MSVM, we use a Gaussian radial basis function kernel, whose kernel width is cross validated from a set of predefined candidates. We find its performance to be sensitive to the setting of the kernel width, which is a common issue for SVMs (Cherkassky and Ma, 2004; Soares et al., 2004; Chang et al., 2005). If an appropriate kernel width could be identified through cross validation, we find that learning the mapping z becomes less important for pa SB-MSVM to perform well. However, we find that if the kernel width is not well selected, which can happen if all candidate kernel widths are far from the optimal value, the binary classifier for each category may not have enough capacity for nonlinear classification and the learning of the category-stick mapping z could then become important. 6.3 Turning OffUnneeded Model Capacities While one can adjust both K and T to control the capacity of binary softplus regression, for MSR, the total number of experts K is a truncation level that can be set as large as permitted by the computation budget. This is because the truncated gamma process used by each stick-specific binary softplus regression shrinks the weights of unnecessary experts towards zeros. Figure 5 shows in decreasing order the inferred weights of the experts belonging to each of the 3 categories of the square data set. These weights are inferred by MSR with K = T = 10 and the learning of z, as in the fourth row of Figure 2. It is clear from Figure 5 that only a small number of experts are inferred with non-negligible weights in the posterior, and the number of active experts and their weights indicate the complexity Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression Figure 6: First row: classification of a 2-D swiss roll data by pa SB-MSR with K = 5, T = 3, using the original covariates. Second row: pa SB-MSR with K = 5, T = 1 trained on the covariates transformed via the pa SB-MSR used in the first row. In each row, the left column plot the log-likelihood against MCMC iteration, and the middle and right columns show the predictive probability heatmaps for Category 1 (black points) and Category 2 (blue points), respectively. of the corresponding classification decision boundaries shown in the fourth row of Figure 2. We note that while T is a parameter to be set by the user, we find increasing it increases model capacity, without observing clear signs of overfitting for all the data considered here. 6.4 MSR with Data Transformation Kernel SVMs transform the data to make different categories more linearly separable in the transformed covariate space. While kernel SVMs may provide high nonlinear modeling capacity, its performance could be sensitive to the kernel width, which often needs to be cross validated, and its number of support vectors often increases linearly in the size of the training set. By contrast, MSRs rely on the interactions of linear hyperplanes to construct nonlinear decision boundaries, as discussed in Section 4.2, and hence may have insufficient capacity for highly complex nonlinearity. However, we may simply stack another MSR on a previously trained MSR to quickly enhance its nonlinear modeling capacity. In particular, we may first run a MSR to obtain a finite set of hyperplanes denoted by β (t+1) jk . We may Zhang and Zhou then augment the original covariate vector xi as xi := x i, log 1 + ex i β (2) 11 , , log 1 + ex i β (t+1) jk , , log 1 + ex i β (T +1) SK (14) and run another MSR with the transformed covariates xi. For illustration, we show the efficacy of this data-transformation strategy on a 2-D swiss roll data in Figure 6. The first row shows the results of MSR with K = 5 and T = 3, using the original covariates xi, while the second row shows MSR with K = 5 and T = 1, using the transformed covariates xi defined by (14), where the regression coefficient vectors β (t+1) jk are learned using the MSR illustrated in the first row. It is evident that the classification is greatly improved in terms of both training log-likelihood and out-of-sample predictions. 6.5 Results on Benchmark Data Sets To further evaluate the performance of the proposed pa SB multinomial regression models, we consider pa SB multinomial logistic regression (pa SB-MLR), pa SB multinomial robit with κ = 6 degrees of freedom (pa SB-robit), pa SB multinomial support vector machine (pa SB-MSVM), and MSRs. We compare their performance with those of L2 regularized multinomial logistic regression (L2-MLR), support vector machine (SVM), and adaptive multi-hyperplane machine (AMM), and consider the following benchmark multi-class classification data sets: iris, wine, glass, vehicle, waveform, segment, dna, and satimage. We also include the synthetic square data shown in Figure 2 for comparison. For SVM we use the LIBSVM package, which trains S(S 1)/2 one-vs-one binary classifiers and makes prediction using majority voting (Chang and Lin, 2011). We run LIBSVM in R with package e1071 (Meyer et al., 2015). We consider MSRs with (K, T) as (1, 1), (1, 3), (5, 1), and (5, 3), respectively. We also consider MSR with data transformation (DT-MSR), in which we first train a MSR with K = 5 and T = 3 to transform the covariates and then stack another MSR with K = 5 and T = 1. We provide detailed descriptions on the data and experimental settings in the Appendix. With the number of categories in parentheses right after the data set names, we summarize in Table 1 the classification error rates by various models, where those of MSRs are calculated by averaging over pa SB and par SB. Table 1 shows that an MSR with K or T sufficiently large generally outperforms pa SB-MLR, pa SB-robit, L2-MLR, and AMM, and using another MSR on the transformed covariates can in general further reduce the error rate. This is especially evident when there are nonlinearly separable categories, as indicated by a clearly higher error rate of L2-MLR in contrast to that of SVM. One may notice that pa SB-robit, pa SB-MLR, and MSR with K = T = 1 are similar to L2-MLR in terms of performance, suggesting the effectiveness of the proposed permutation scheme, which helps mitigate the potential adverse effects of having asymmetric class labels. One may also note that pa SB-robit outperforms pa SB-MLR on glass, vehicle, waveform, dna, and satimage, indicating there are benefits in using a robust classifier on these data sets. Comparable error rates of pa SB-MSVM to SVM and better performance of MSRs on most data sets demonstrate the success of the pa SB framework in transforming a binary classifier with cross entropy loss into a Bayesian multinomial one. To further check whether a pa SB model is attractive when fast out-of-sample prediction is desired, we consider using only the MCMC sample that has the highest training likelihood Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression Data (S) pa SBMLR pa SBrobit pa SBMSVM K = 1 T = 1 K = 1 T = 3 K = 5 T = 1 K = 5 T = 3 DTMSR L2MLR SVM AMM square(3) 59.52 67.46 0 57.14 15.08 0 0 0 62.29 4.76 16.67 iris(3) 4.00 5.33 3.33 4.67 4.00 4.00 4.00 3.33 3.33 4.00 4.67 wine(3) 4.44 5.00 2.78 2.78 2.22 2.78 2.22 2.78 3.89 2.78 3.89 glass(6) 35.35 34.88 29.30 33.49 26.05 31.16 32.09 26.37 33.02 28.84 37.67 vehicle(4) 23.23 21.25 17.32 22.44 17.32 17.72 15.75 14.96 22.83 18.50 21.89 waveform(4) 17.87 16.42 15.76 19.84 16.62 15.67 15.04 15.56 15.60 15.22 18.54 segment(7) 7.36 8.03 7.98 6.20 6.49 6.45 5.63 7.65 8.56 6.20 12.47 dna(3) 5.06 4.05 5.31 4.13 4.47 4.55 4.22 3.88 5.98 4.97 5.43 satimage(6) 20.65 17.25 8.90 16.65 14.45 12.85 12.00 9.85 17.80 8.50 15.31 Table 1: Comparison of the classification error rates (%) of pa SB-MLR, pa SB-robit, pa SBMSVM, MSRs with various K and T (columns 5 to 8), MSR with data transformation (DT-MSR), L2-MLR, SVM, and AMM. among the collected ones for all pa SB models, and summarize in Table 4 of the Appendix the classification error rates of various models, with the number of inferred support vectors or active hyperplanes included in parenthesis. Following the definition of active experts in Zhou (2016), we define for MSRs the number of active hyperplanes as T PS s e Ks where e Ks is the number of active experts for class s. The number of active hyperplanes determines the computational complexity for out-of-sample prediction with a single MCMC sample, which is O(T PS s e Ks). Since the error rates of MSRs in Table 4 are calculated by averaging over both pa SB and par SB, the number of active hyperplanes is T PS s ( e K(pa SB) s + e K(par SB) s ). Shown in Figure 8 in the Appendix are boxplots of the number of each category s active experts for MSR with K = 5 and T = 3. Except for several categories of satimage that require all K = 5 experts for par SB-MSR, K = 5 is large enough to provide the needed model capacity under all the other scenarios. As shown in Table 4, MSRs with sufficiently large K and/or T are comparable to both SVM and pa SB-MSVM in terms of the error rates, while clearly outperforming them in terms of the number of (active) hyperplanes/support vectors and hence computational complexity for out-of-sample predictions. While MSR with K = T = 1, pa SB-MLR, and pa SB-robit generally perform worse than SVM in terms of the error rates, they use much fewer hyperplanes and hence have significantly lower computation for out-of-sample predictions. In summary, MSR whose upper-bound for the number of active expects K and number of layers for each expert T can both be adjusted to control its capacity of modeling nonlinearity, can achieve a good compromise between the accuracy and computational complexity for out-of-sample prediction of multinomial class probabilities, and can be further improved by training an additional MSR on the transformed covariates. We further measure how well the Gibbs sampler is mixing using effective sample size (ESS) for both pa SB-MLR and Bayesian multinomial logistic regression (Bayes MLR) of Polson et al. (2013). For both algorithms we let βj N 0, diag(α 1 j0 , . . . , α 1 j V ) , where αjv Gamma(0.001, 1/0.001). The ESS (Holmes and Held, 2006) of a parameter or a function of parameters is defined as ESS = L/ [1 + 2 P h=1 ρ(h)] , where L is the number of post-burn-in samples, ρ(h) is the hth autocorrelation of the parameter or the function of parameters. It describes how quickly an MCMC algorithm generates independent samples. Zhang and Zhou 10% quantile median 90% quantile training testing training testing training testing Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR square 411.46 194.54 421.62 256.48 913.17 948.53 858.46 952.33 924.48 973.60 927.93 976.33 iris 82.30 90.33 73.75 84.40 149.47 218.21 156.25 174.16 331.71 854.45 341.39 793.00 wine 194.41 314.41 58.41 56.30 467.73 859.67 506.66 643.90 926.55 991.46 958.12 994.77 glass 70.18 162.69 67.81 138.90 137.18 335.14 122.27 329.21 359.75 686.43 347.40 615.02 vehicle 77.71 103.30 74.05 101.64 133.66 230.83 127.22 230.48 426.44 460.07 414.48 453.87 waveform 123.77 120.01 120.99 113.94 199.00 203.38 191.77 209.61 310.84 499.05 291.96 478.59 segment 114.91 104.77 95.63 91.31 281.11 294.17 270.63 355.46 742.63 844.11 752.74 814.62 dna 217.99 238.48 63.66 67.26 481.65 736.58 505.59 772.32 911.68 986.60 927.30 991.63 satimage 53.51 66.19 54.04 66.33 82.50 90.50 81.87 90.11 160.84 168.85 156.83 157.31 Table 2: Comparison of the ESS of the conditional class probability between Bayes MLR and pa SB-MLR. Since the Gibbs sampler of Bayes MLR samples one βj conditioning on all βj for j = j, which may lead to strong dependencies between different categories and hence slow down the mixing of the Markov chain. By contrast, the βj s are conditionally independent given the augmented variables bij s in pa SB-MLR, which may lead to faster mixing. For both Bayes MLR and pa SB-MLR, we consider five independent random trials, in each of which we randomly initialize the model parameters, run 10,000 Gibbs sampling iterations, and collect the last 1,000 MCMC samples of βj. We use the mcmcse package (Flegal et al., 2016) to estimate the ESS of each pij in a random trial using the 1,000 collected MCMC samples. For the training set, we calculate the 10% quantile, median, and 90% quantile of the ESSs of all pij for each random trial, and then report their averages over the five random trials in Table 2. For the testing set, we follow the same steps and report the results in Table 2. While pa SB-MLR underperforms Bayes MLR on some of the data sets for the 10% ESS quantile, they consistently outperform Bayes MLR on all data sets for both the ESS median and 90% ESS quantile, for both training and testing. 6.6 Robustness of pa SB-Robit Regression We use the contaminated vehicle data to demonstrate the robustness of pa SB-robit. As discussed by Liu (2004), the heavy-tailed conditional class probability function of robit regression can robustify the decision boundary when there exist outliers. We use the vehicle training set as inliers, synthesize outliers that are far from inliers, combine both as the new training set, and keep the testing set unchanged. We generate different numbers of outliers so that the ratio of outliers to inliers varies from 0, 0.1, 0.2, 0.3, to 0.5, at each of which we randomly simulate 10 different sets of outlier covariates. We provide the details on how we generate outliers in the Appendix. We compare L2-MLR and pa SB-robit with κ = 1 degree of freedom on the contaminated vehicle data. Figure 7 shows the prediction error rate (mean standard deviation) of the testing set for different outlier-inlier ratios. When there are no outliers, both approaches delivers comparable performances. As the ratio increases, pa SB-robit with κ = 1 more and more clearly outperforms L2-MLR, which justifies the robustness of pa SB-robit. Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression 0.0 0.1 0.2 0.3 0.4 0.5 outliers : inliers prediction error rate (%) MLR pa SB robit Figure 7: Prediction error rates (%, mean standard deviation) for different ratios of outliers to inliers. 7. Conclusions To transform a cross-entropy-loss binary classifier into a Bayesian multinomial regression model and derive efficient Bayesian inference, we develop a permuted and augmented stickbreaking construction. With permutation, we one-to-one map the categories to sticks to escape from poor category-stick mappings that impose restrictive geometric constraints on the decision boundaries, and with augmentation, we link a category outcome to conditionally independent stick-specific covariate-dependent Bernoulli random variables. We illustrate this general framework by extending binary softplus regression, robit regression, and support vector machine into multinomial ones. Experiment results validate our contributions and show that the proposed multinomial softplus regressions achieve a good compromise between interpretability, complexity, and predictability. Acknowledgments The authors would like to thank the editor and two anonymous referees for their insightful and constructive comments and suggestions, and Texas Advanced Computing Center for computational support. Appendix A. Additional Lemma and Proofs Proof [Proof of Theorem 1] The conditional probability of yi given {zs, πis}1,S can be expressed as P(yi = s | {zs, πis}1,S) = P bij:j>zs [P(bizs = 1)]1(zs =S) h Q jzs P(bij) i = [P(bizs = 1)]1(zs =S) h Q jzs h Q j>zs P(bij) i , which becomes the same as (4) by applying (5) and P bij:j>zs Q j>zs P(bij) =1. Zhang and Zhou Proof [Proof of Lemma 3] Under the pa SB construction, the probability ratio of categories (choices) s and s + d is a function of the stick success probabilities πzs, πz(s+1), , πz(s+d). More specifically, pis(z) = π 1(z(s+d) =S) iz(s+d) h Q zs jz(s+d)) . Proof [Proof of Lemma 4] Since pis(z) = h 1 1 + ex iβs rsi1(s =S) Q j p0 are bounded by the set of solutions to 1 + ex iβj rs > p0, j {1, . . . , s 1}, and 1 1 + ex iβs rs > p0, and hence bounded by the convex polytope defined by the set of solutions to the s inequalities as x i[( 1)1(j=s)βj] < ( 1)1(j=s) ln h p1(j =s) 0 (1 p0)1(j=s)i 1 rj 1 , j {1, . . . , s}. Lemma 5 Without loss of generality, let us assume that the category-stick mapping is fixed at z = (1, 2, , S). The pa SB multinomial logistic model that assigns choice s {1, . . . , S} for individual i with probability pis = (πis)1(s =S) Q j P j s Uij is observed for s = 1, . . . , S, where Uis are defined as Ui1 = Ui2 + + Ui S + Wi1 + εi1, j>s Uij + Wis + εis, Ui(S 1) = Wi(S 1) + εi(S 1), and εis i.i.d. Logistic(0, 1) are independent, and identically distributed (i.i.d.) random variables following the standard logistic distribution. Proof [Proof of Lemma 5] Note that P(ε < x) = 1/(1 + e x) if ε Logistic(0, 1). First consider the choice of individual i be yi = 1, which would happen with probability P(yi = 1) = P Ui1 > X j 1 Uij = P(εi1 > Wi1) = 1/(1 + e Wi1) = πi1 = pi1. Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression square iris wine glass vehicle waveform segment dna satimage Train size 294 120 142 171 592 500 231 2000 4435 Test size 126 30 36 43 254 4500 2079 1186 2000 Covariate number 2 4 13 9 18 21 19 180 36 Category number 3 3 3 6 4 3 7 3 6 Table 3: Multi-class classification data sets used in experiments for model comparison. Then for s = 2, , S 1, P(yi = s) = P(yi = s | yi > s 1)P(yi > s 1) = P Uis > X j s 1 P Uij < X = P(εis > Wis) Y j s 1 P(εij < Wij) j s 1(1 πij) Finally, P(yi = S) = 1 P j