# permuted_and_augmented_stickbreaking_bayesian_multinomial_regression__b57a471e.pdf

Journal of Machine Learning Research 18 (2018) 1-33 Submitted 7/17; Revised 12/17; Published 4/18

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Quan Zhang quan.zhang@mccombs.utexas.edu Mingyuan Zhou mingyuan.zhou@mccombs.utexas.edu Department of Information, Risk, and Operations Management Mc Combs School of Business The University of Texas at Austin Austin, TX 78712, USA

Editor: David M. Blei

To model categorical response variables given their covariates, we propose a permuted and augmented stick-breaking (pa SB) construction that one-to-one maps the observed categories to randomly permuted latent sticks. This new construction transforms multinomial regression into regression analysis of stick-speciﬁc binary random variables that are mutually independent given their covariate-dependent stick success probabilities, which are parameterized by the regression coeﬃcients of their corresponding categories. The pa SB construction allows transforming an arbitrary cross-entropy-loss binary classiﬁer into a Bayesian multinomial one. Speciﬁcally, we parameterize the negative logarithms of the stick failure probabilities with a family of covariate-dependent softplus functions to construct nonparametric Bayesian multinomial softplus regression, and transform Bayesian support vector machine (SVM) into Bayesian multinomial SVM. These Bayesian multinomial regression models are not only capable of providing probability estimates, quantifying uncertainty, increasing robustness, and producing nonlinear classiﬁcation decision boundaries, but also amenable to posterior simulation. Example results demonstrate their attractive properties and performance. Keywords: Discrete choice models, logistic regression, nonlinear classiﬁcation, softplus regression, support vector machines

1. Introduction

Inferring the functional relationship between a categorical response variable and its covariates is a fundamental problem in physical and social sciences. To address this problem, it is common to use either multinomial logistic regression (MLR) (Mc Fadden, 1973; Greene, 2003; Train, 2009) or multinomial probit regression (Albert and Chib, 1993; Mc Culloch and Rossi, 1994; Mc Culloch et al., 2000; Imai and van Dyk, 2005), both of which can be expressed as a latent-utility-maximization model that lets an individual make the decision by comparing its random utilities across all categories at once. In this paper, we address the problem via a new stick-breaking construction of the multinomial distribution, which deﬁnes a one-to-one random mapping between the category and stick indices. Rather than assuming an individual compares its random utilities across all categories at once, we assume an individual makes a sequence of stick-speciﬁc binary random decisions. The choice

c 2018 Quan Zhang and Mingyuan Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-409.html.

Zhang and Zhou

of the individual is the category mapped to the stick that is the ﬁrst to choose 1, or the category mapped to stick S if all the ﬁrst S 1 sticks choose 0. This framework transforms the problem of regression analysis of categorical variables into the problem of inferring the one-to-one mapping between the category and stick indices, and performing regression analysis of binary stick-speciﬁc random variables. Both MLR and the proposed stick-breaking models link a categorical response variable to its covariate-dependent probability parameters. While MLR is invariant to the permutation of category labels, given a ﬁxed category-stick mapping, the proposed stick-breaking models purposely destruct that invariance. We are motivated to introduce this new framework for discrete choice modeling mainly to facilitate eﬃcient Bayesian inference via data augmentation, introduce nonlinear decision boundaries, and relax a well-recogonized restrictive model assumption of MLR, as described below. An important motivation is to extend eﬃcient Bayesian inference available to binary regression to multinomial one. In the proposed stick-breaking models, the binary stickspeciﬁc random variables of an individual are conditionally independent given their stickspeciﬁc covariate-dependent probabilities. Under this setting, one can solve a multinomial regression by solving conditionally independent binary ones. The only requirement is that the underlying binary regression model uses the cross entropy loss. In other words, we require each stick-speciﬁc binary random variable to be linked via the Bernoulli distribution to its corresponding stick-speciﬁc covariate-dependent probability parameter. Another important motivation is to improve the model capacity of MLR, which is a linear classiﬁer in the sense that if the total number of categories is S, then MLR uses the intersection of S 1 linear hyperplanes to separate one class from the others. By choosing nonlinear binary regression models, we are able to enhance the capacities of the proposed stick-breaking models. We are also motivated to relax the independence of irrelevant alternative (IIA) assumption, an inherent property of MLR that requires the probability ratio of any two choices to be independent of the presence or characteristics of any other alternatives (Mc Fadden, 1973; Greene, 2003; Train, 2009). By contrast, the proposed stickbreaking models make the probability ratio of two choices depend on other alternatives, as long as the two sticks that both choices are mapped to are not next to each other. In light of these considerations, we will ﬁrst extend the softplus regressions recently proposed in Zhou (2016), a family of cross-entropy-loss binary classiﬁers that can introduce nonlinear decision boundaries and can recover logistic regression as a special case, to construct Bayesian multinomial softplus regressions (MSRs). We then consider a multinomial generalization of the widely used support vector machine (SVM) (Boser et al., 1992; Cortes and Vapnik, 1995; Sch olkopf et al., 1999; Cristianini and Shawe-Taylor, 2000), a max-margin binary classiﬁer that uses the hinge loss. While there has been signiﬁcant eﬀort in extending binary SVMs into multinomial ones (Crammer and Singer, 2002; Lee et al., 2004; Liu and Yuan, 2011), the resulted extensions typically only provide the predictions of deterministic class labels. By contrast, we extend the Bayesian binary SVMs in Sollich (2002) and Mallick et al. (2005) under the proposed framework to construct Bayesian multinomial SVMs (MSVMs), which naturally provide predictive class probabilities. We will show that the proposed Bayesian MSRs and MSVMs, which all generalize the stick-breaking construction to perform Bayesian multinomial regression, are not only capable of placing nonlinear decision boundaries between diﬀerent categories, but also amenable

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

to posterior simulation via data augmentation. Another attractive feature shared by all these proposed Bayesian algorithms is that they can not only predict class probabilities but also quantify model uncertainty. In addition, we will show that robit regression, a robust cross-entropy-loss binary classiﬁer proposed in Liu (2004), can be extended into a robust Bayesian multinomial classiﬁer under the proposed stick-breaking construction. The remainder of the paper is organized as follows. In Section 2 we brieﬂy review MLR and discuss the restrictions of its stick-breaking construction. In Section 3 we propose the permuted and augmented stick breaking (pa SB) to construct Bayesian multi-class classiﬁers, present the inference, and show how the IIA assumption is relaxed. Under the pa SB framework, we show how to transform softplus regressions and support vector machines into Bayesian multinomial regression models in Sections 4 and 5, respectively. We provide experimental results in Section 6 and conclude the paper in Section 7.

2. Multinomial Logistic Regression and Stick Breaking

In this section we ﬁrst brieﬂy review multinomial logistic regression (MLR). We then use the stick-breaking construction to show how to generate a categorical random variable as a sequence of dependent binary variables, and further discuss a naive approach to transform binary logistic regression under stick breaking into multinomial regression. In the following discussion, we use i {1, . . . , N} to index the individual/observation, s {1, . . . , S} to index the choice/category, and the prime symbol to denote the transpose operation.

2.1 Multinomial Logistic Regression

MLR that parameterizes the probability of each category given the covariates as

P(yi = s | xi, {βs}1,S) = pis, pis = ex iβs PS j=1 ex iβj (1)

is widely used, where xi RP+1 consists of xi1 = 1 and P covariates, and βs RP+1

consists of the regression coeﬃcients for the sth category (Mc Cullagh and Nelder, 1989; Albert and Chib, 1993; Holmes and Held, 2006). Without loss of generality, one may choose category S as the reference category by setting all the elements of βS as 0, making ex iβS = 1 almost surely (a.s.). For MLR, if data i is assigned to the category with the largest pis, then one may consider that category s resides within a convex polytope (Gr unbaum, 2013), deﬁned by the set of solutions to S 1 inequalities as x (βj βs) 0, where j {1, . . . , s 1, s + 1, . . . , S}. Despite its popularity, MLR is a linear classiﬁer in the sense that it uses the intersection of S 1 linear hyperplanes to separate one class from the others. As a classical discrete choice model in econometrics, it makes the independence of irrelevant alternatives (IIA) assumption, implying that the unobserved factors for choice making are both uncorrelated and having the same variance across all alternatives (Mc Fadden, 1973; Train, 2009). Moreover, while its log-likelihood is convex and there are eﬃcient iterative algorithms to ﬁnd the maximum likelihood or maximum a posteriori solutions of βs, the absence of conjugate priors on βs makes it diﬃcult to derive eﬃcient Bayesian inference. For Bayesian inference, Polson et al. (2013) have introduced the P olya-Gamma data augmentation for logit models, and combined it with the data augmentation technique of Holmes and Held (2006) for the

Zhang and Zhou

multinomial likelihood to develop a Gibbs sampling algorithm for MLR. This algorithm, however, has to update βs one at a time while conditioning on all βj for j = s. Thus it may not only lead to slow convergence and mixing, especially when the number of categories S is large, but also prevent us from parallelizing the sampling of {βs}1,S within each MCMC iteration.

2.2 Stick Breaking

Suppose yi is a random variable drawn from a categorical distribution with a ﬁnite vector of probability parameters (pi1, . . . , pi S), where S < , pis 0, and PS s=1 pis = 1. Instead of directly using yi PS s=1 pisδs, one may consider generating yi using the multinomial stick-breaking construction that sequentially draws binary random variables

bis {bij}j<s Bernoulli h 1 P j<s bij πis i , πis = pis 1 P j<s pij (2)

for s = 1, 2, . . . , S. Note that πi S = 1 and bi S = 1 PS 1 j=1 bij by construction. Deﬁning yi = s if and only if bis = 1 and bij = 0 for all j = s, then one has a strick-breaking representation for the multinomial probability parameter as

P(yi = s | {πis}1,S) = P(bis = 1) Q j =s P(bij = 0) = πis Q j<s(1 πij), (3)

which, as expected, recovers pis by substituting the deﬁnitions of πis shown in (2). The ﬁnite stick-breaking construction in (3) can be further generalized to an inﬁnite setting, as widely used in Bayesian nonparametrics (Hjort et al., 2010). For example, the stick-break construction of Sethuraman (1994) represents the length of the kth stick using the product of k stick-speciﬁc probabilities that are independent, and identically distributed (i.i.d.) beta random variables. It represents a size-biased random permutation of a Dirichlet process (DP) (Ferguson, 1973) random draw, which includes countably inﬁnite atoms whose weights sum to one. The stick-breaking construction of Sethuraman (1994) has also been generalized to represent a draw from a random probability measure that is more general than the DP (Pitman, 1996; Ishwaran and James, 2001; Wang et al., 2011a). Related to this paper, one may further consider making the stick-specﬁc probabilities depend on the covariates (Dunson and Park, 2008; Chung and Dunson, 2009; Ren et al., 2011). For example, the logistic stick-breaking process of Ren et al. (2011) uses the product of k covariate-dependent logistic functions to parameterize the probability of the kth stick. To implement a stick-breaking process mixture model, truncated stick-breaking representations with a ﬁnite number of sticks are commonly used, with inference developed via both Gibbs sampling (Ishwaran and James, 2001; Dunson and Park, 2008; Rodriguez and Dunson, 2011) and variational approximation (Blei and Jordan, 2006; Kurihara et al., 2007; Ren et al., 2011). Another related work is the order-based dependent Dirichlet processes of Griﬁn and Steel (2006), which use an ordered stick-breaking construction for mixture modeling, encouraging the data samples close to each other in the covariate space to share similar orders of the sticks and hence similar mixture weights. We will show that the proposed stick-breaking construction is distinct in that all data samples share the same category-stick mapping inferred from the data, with the category labels mapped to lower-indexed sticks subject to fewer geometric constraints on their decision boundaries.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

2.3 Logistic Stick Breaking

The stick-breaking construction parameterizes each pis with the product of s probability parameters and links each yi with a unit-norm binary vector (bi1, . . . , bi S), where biyi = 1 and bij = 0 a.s. if j = yi. Following the logistic stick-breaking construction of Ren et al. (2011), one may represent pis with (3) and parameterize the logit of each πis with a latent Gaussian variable wis as πis = ewis/PS j=1 ewij. To model observed or latent multinomial variables, a stick-breaking procedure, closely related to that of Ren et al. (2011), is used in Khan et al. (2012) to transform the modeling of multinomial probability parameters into the modeling of the logits of binomial probability parameters using Gaussian latent variables. As shown in Linderman et al. (2015), this procedure allows using the P olya Gamma data augmentation, without requiring the assistance of the technique of Holmes and Held (2006), to construct Gibbs sampling that simultaneously updates all categories in each MCMC iteration, leading to improved performance over the one proposed in Polson et al. (2013). The simpliﬁcation brought by the stick-breaking representation, which stochastically arranges its categories in decreasing order, comes with a clear change in that it removes the invariance of the multinomial distribution to label permutation. While the loss of invariance to label permutation may not pose a major issue for Bayesian mixture models inferred with MCMC (Jasra et al., 2005; Kurihara et al., 2007), it appears to be a major obstacle when applying stick breaking for multinomial regression, where the performance is often found to be sensitive to how the labels of the S categories are ordered. In particular, if one constructs a logistic stick breaking model by letting logit(πis) = wis = x iβs, which means πis = (1 + e x iβs) 1, then one has

pis = 1 + e x iβs 1 Q j<s 1 + ex iβj 1,

which clearly tends to impose fewer geometric constraints on the classiﬁcation decision boundaries of a category with a smaller s. For example, pi1 = 1 + e x iβ1 1 is larger than 50% if x iβ1 > 0 while pi2 = 1 + ex iβ1 1 1 + e x iβ2 1 is possible to be larger than 50% only if both x iβ1 < 0 and x iβ2 > 0. We will use an example to illustrate this type of geometric constraints in Section 6.1. Under the logistic stick-breaking construction, not only could the performance be sensitive to how the S diﬀerent categories are ordered, but the imposed geometric constraints could also be overly restrictive even if the categories are appropriately ordered. Below we address the ﬁrst issue by introducing a permuted and augmented stick-breaking representation for a multinomial model, and the second issue by adding the ability to model nonlinearity.

3. Permuted and Augmented Stick Breaking

To turn the seemingly undesirable sensitivity of the stick-breaking construction to label permutation into a favorable model property, when label asymmetry is desired, and mitigate performance degradation, when label symmetry is desired, we introduce a permuted and augmented stick-breaking (pa SB) construction for a multinomial distribution, making it straightforward to extend an arbitrary binary classiﬁer with cross entropy loss into a

Zhang and Zhou

Bayesian multinomial one. The pa SB construction infers a one-to-one mapping between the labels of the S categories and the indices of the S latent sticks, transforming the problem from modeling a multinomial random variable into modeling S conditionally independent binary ones. It not only allows for parallel computation within each MCMC iteration, but also improves the mixing of MCMC in comparison to the one used in Polson et al. (2013), which updates one regression-coeﬃcient vector conditioning on all the others, as will be shown in Section 6.5. Note that the number of distinct one-to-one label-stick mappings is S!, which quickly becomes too large to exhaustively search for the best mapping as S increases. Our experiments will show that the proposed MCMC algorithm can quickly escape from a purposely poorly initialized mapping and subsequently switch between many diﬀerent mappings that all lead to similar performance, suggesting an eﬀective search space that is considerably smaller than S!.

3.1 Category-Stick Mapping and Data Augmentation

The proposed pa SB construction randomly maps a category to one and only one of the S latent sticks and makes the augmented Bernoulli random variables {bis}1,S conditionally independent to each other given {πis}1,S. Denote z = (z1, . . . , z S) as a permutation of (1, . . . , S), where zs {1, . . . , S} is the index of the stick that category s is mapped to. Given the label-stick mapping z, let us denote pis(z) as the multinomial probability of category s, and πizs(xi, βs) as the covariate-dependent stick probability that is associated with the covariates of observation i and the stick that category s is mapped to. For notational convenience, we will write πizs(xi, βs) as πizs and πij(xi, βs:zs=j) as πij. We emphasize that here the sth regression-coeﬃcient vector βs is always associated with both category s and the corresponding stick probabily πizs, a construction that will facilitate the inference of the label-stick mapping z. The following Theorem shows how to generate a categorical random variable of S categories with a set of S conditionally independent Bernoulli random variables. This is key to transforming the problem from solving multinomial regression into solving S binary regressions independently.

Theorem 1 Suppose yi PS s=1 pis(z)δs, where [pi1(z), . . . , pi S(z)] is a multinomial probability vector whose elements are constructed as

pis(z) = (πizs)1(zs =S) Y

j<zs (1 πij), (4)

then yi can be equivalently generated under the permuted and augmented stick-breaking (pa SB) construction as

n [1(bizs = 1)]1(zs =S) Q j<zs 1(bij = 0) o δs , (5)

bij Bernoulli(πij), j {1, . . . , S}. (6)

Distinct from the conventional stick breaking in (2) that maps category s to stick s and makes bis depend on bij, j = 1, . . . , s 1, under the new construction in (5)-(6), the S categories are now randomly permuted and then one-to-one mapped to S sticks, and

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

the augmented binary random variables {bij}j become mutually independent given {πij}j. Given yi, we still have bij = 0 for j < zyi and bizyi = 1 a.s., but impose no restriction on any bij for j > zyi, whose conditional posteriors given yi and πij remain the same as their priors. These changes are key to appropriately ordering the latent sticks, more ﬂexibly parameterizing πizs and hence pis(z), and maintaining tractable inference. With pa SB, the problem of inferring the functional relationship between the categorical response yi and the corresponding covariates xi is now transformed into the problem of modeling S conditionally independent binary regressions as

bizs | xi, βs Bernoulli[πizs(xi, βs)], i = 1, . . . , N, s = 1, . . . , S.

Note that the only requirement for the binary regression model under pa SB is that it uses the Bernoulli likelihood. In other words, it uses the cross entropy loss (Murphy, 2012) as

i=1 ln P(bizs | xi, βs) =

bizs ln πizs(xi, βs) (1 bizs) ln[1 πizs(xi, βs)] .

A basic choice is pa SB logistic regression that lets

πizs(xi, βs) = 1/(1 + e xiβs),

which becomes the same as the logistic stick breaking construction described in Section 2.3 if zs = s for all s {1, . . . , S}. Another choice is pa SB-robit regression that extends robit regression of Liu (2004), a robust binary classiﬁer using cross entropy loss, into a robust Bayesian multinomial classiﬁer. In robit regression, observation i is labeled as 1 if x iβ+εi > 0 and as 0 otherwise, where εi are independently drawn from a t-distribution with

κ degrees of freedom, denoted as εi iid tκ. Consequently, the conditional class probability function of robit regression is P(yi = 1 | xi, β) = Fκ(x iβ), where Fκ is the cumulative density function of tκ. The robustness is attributed to the heavy-tail property of Fκ(x iβ), which, if κ < 7, imposes less penalty than the conditional class probability function of logistic regression does on misclassiﬁed observations that are far from the decision boundary. Applying Theorem 1, the category probability of pa SB-robit regression with κ degrees of freedom is shown in (4), where πizs(xi, βs) = Fκ(x iβs). The pa SB-robit regression provides a simple solution to robust multiclass classiﬁcation; with {bij}i,j deﬁned in Theorem 1, we run independent binary robit regressions using the Gibbs sampler proposed in Liu (2004). In addition to pa SB, we deﬁne permuted and augmented reverse stick breaking (par SB) in the following Corollary.

Corollary 2 Suppose yi PS s=1 pisδs and

pis(z) = (1 πizs)1(zs =S) Q j<zs πij ,

then yi can also be generated under the permuted and augmented reverse stick-breaking (par SB) representation as

n [1(bizs = 0)]1(zs =S) Q j<zs 1(bij = 1) o δs , (7)

bij Bernoulli(πij), j {1, . . . , S}.

Zhang and Zhou

Generally speaking, if πizs(xi, βs) = 1 πizs(xi, βs), which is the case for logistic stick breaking and robit stick breaking, where πizs are deﬁned as (1 + e x iβs) 1 and Fκ(x iβs), respectively, and Bayesian multinomial SVMs to be discussed in Section 5, then there is no need to introduce par SB as an addition to pa SB. Otherwise, there are potential beneﬁts, such as for softplus regressions to be introduced in Section 4, to combine par SB with pa SB.

3.2 Inference of Stick Variables and Category-Stick Mapping

Below we ﬁrst describe Gibbs sampling for the augmented stick variables {bij}1,S, and then introduce a Metropolis-Hastings (MH) step to infer the category-stick mapping z. Given the category label yi, stick probability πij, and z, we sample bij as

(bij | yi, πij, z) 1(j = zyi) + 1(j > zyi)Bernoulli(πij),

for j = 1, . . . , S 1, and let bi S = 1(zyi = S).

This means we let bij = 0 if j < zyi, bij = 1 if j = zyi, draw bij from Bernoulli(πij) if zyi < j < S, and let bi S = 1 if and only if zyi = S. Note that stick S is used as a reference stick and πi S is not used in deﬁning pis(z) in (4). Despite having no impact on computing {pis}1,S, we infer πi S (i.e., sample the regression-coeﬃcient vector βs :zs =S)

under the likelihood QN i=1 Bernoulli(bi S; πi S) and use it in a Metropolis-Hastings step, as described in (8) shown below, to decide whether to switch the mappings of two diﬀerent categories, if one of which is mapped to the reference stick S. Once we have an MCMC sample of {bij}1,S, we then essentially solve independently S binary classiﬁcation problems, the jth of which can be expressed as bij | xi, βs:zs=j Bernoulli[πij(xi, βs:zs=j)]. Analogously, for par SB, {bij}1,S can be sampled as (bij | yi, πij, z) 1(j < zyi) + 1(j > zyi)Bernoulli(πij) for j = 1, . . . , S 1, and bi S = 1 1(zyi = S), which means we let bij = 1 if j < zyi, let bij = 0 if j = zyi, draw bij from Bernoulli(πij) if zyi < j < S, and let bi S = 0 if and only if zyi = S. Since stick-breaking multinomial classiﬁcation is not invariant to the permutation of its class labels, it may perform substantially worse than it could be if the inherent geometric constraints implied by the current ordering of the labels make it diﬃcult to adapt the decision boundaries to the data. Our solution to this problem is to infer the one-to-one mapping between the category labels and stick indices from the data. We construct a Metropolis-Hastings (MH) step within each Gibbs sampling iteration, with a proposal of switching two sticks that categories c and c , 1 c < c S, are mapped to, by changing the current category-stick one-to-one mapping from z = (z1, . . . , zc, . . . , zc , . . . , z S) to z = (z 1, . . . , z S) := (z1, . . . , zc , . . . , zc, . . . , z S). Assuming a uniform prior on z and proposing (c, c ) uniformly at random from one of the S 2 = S(S 1)/2 possibilities, we would accept the proposal with probability

QS s=1[pis(z )]1(yi=s) QS s=1[pis(z)]1(yi=s) , 1

QS s=1 h (πiz s)1(z s =S) Q

j<z s(1 πij) i1(yi=s)

QS s=1 h (πizs)1(zs =S) Q

j<zs(1 πij) i1(yi=s) , 1

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

3.3 Sequential Decision Making

Random utility models, including both the logit and probit models as special examples, are widely used to infer the functional relationship between a categorical response variable and its covariates. For discrete choice analysis in econometrics (Hanemann, 1984; Greene, 2003; Train, 2009), these models assume that among a set of S alternatives, an individual makes the choice that maximizes his/her utility Uis = Vis + εis, where Vis and εis represent the observable and unobservable parts of Uis, respectively. If Vis is set as Vis = x iβs, then marginalizing out εi = (εi1, . . . , εi S) leads to MLR if all εis follow the extreme value distribution (Mc Fadden, 1973; Greene, 2003; Train, 2009), and multinomial probit regression if all εi follow a multivariate normal distribution (Albert and Chib, 1993; Mc Culloch and Rossi, 1994; Mc Culloch et al., 2000; Imai and van Dyk, 2005). Instead of examining the utilities of all choices before making the decision, the pa SB construction is characterized by a sequential decision making process, described as follows. In step one, an individual decides whether to select the choice mapped to stick 1, or to select a choice among the remaining alternatives, i.e., choices {s : zs {2, . . . , S}}. If the individual selects the choice mapped to stick 1, then the sequential process is terminated. Otherwise this choice is eliminated and the individual proceeds to step two, in which he/she would follow the same procedure to either select the choice mapped to stick 2 or proceed to the next step to select a choice among the remaining alternatives, i.e., choices {s : zs {3, . . . , S}}. The individual, reconsidering none of the eliminated choices, will keep making a one-vs-remaining decision at each step until the termination of the sequential decision making process. This unique sequential decision making procedure relaxes the independence of irrelevant alternatives (IIA) assumption, as described in the following Lemma.

Lemma 3 Under the pa SB construction, the probability ratio of two choices are inﬂuenced by the success probabilities of the sticks that lie between these two choices corresponding sticks. In other words, the probability ratio of two choices will be inﬂuenced by some other choices if they are not mapped to adjacent sticks.

As in Lemma 3, the pa SB construction could adjust how two choices probability ratio depends on the other alternatives by controlling the distance between the two sticks that they are mapped to, and hence provide a unique way to relax the IIA assumption. While the widely used MLR can be considered as a random-utility-maximization model with the IIA assumption, the pa SB multinomial logistic model performs sequential random utility maximization that relaxes this assumption, as described in Lemma 5 in the Appendix.

4. Bayesian Multinomial Softplus Regression

Logistic regression is a cross-entropy-loss binary classiﬁer that can be straightforwardly extended to pa SB multinomial logistic regression (pa SB-MLR). However, it is a linear classiﬁer that uses a single hyperplane to separate one class from the other. To introduce nonlinear classiﬁcation decision boundaries, we consider extending softplus regression of Zhou (2016), a multi-hyperplane binary classiﬁer that uses the cross entropy loss, into multinomial softplus regression (MSR) under pa SB.

Zhang and Zhou

Softplus regression uses the interaction of multiple hyperplanes to construct a union of convex-polytope-like conﬁned spaces to enclose the data labeled as 1, which are hence separated from the data labeled as 0 . It is constructed under a Bernoulli-Poisson link (Zhou, 2015) that thresholds at one a latent Poisson count, with the distribution of the Poisson rate deﬁned as the convolution of the probability density functions of K experts, each of which corresponds to the stack of T gamma distributions with covariate-dependent scale parameters. The number of experts K and the number of layers T can be considered as the two model parameters that determine the nonlinear capacity of the model. More speciﬁcally, for expert k, denoting rk as its weight and βt+1 k as its tth regression-coeﬃcient vector, the conditional class probability can be expressed as

P(yi = 1 | xi, {rk, {β(t+1) k }1,T }1,K) = 1

k=1 (1 pik),

pik = 1 1+ex iβ(T+1) k ln n 1+ex iβ(T ) k ln h 1+. . . ln 1+ex iβ(2) k io rk ;

when K = T = 1, the conditional class probability reduces to

P(yi = 1 | xi, r, β) = 1 1 1 + ex iβ

and when K = T = r = 1, it becomes the same as that of binary logistic regression. Note that a gamma process, a random draw from which is expressed as G = P k=1 rkδ{βt+1 k }1,T , can be used to support a potentially countably inﬁnite number of experts for softplus regression. For this reason, one can set K as large as permitted by computation and relies on the gamma process s inherent shrinkage mechanism to turn oﬀunneeded model capacity (not all K experts will be used if K is set to be suﬃciently large).

4.1 pa SB and par SB Extensions of Softplus Regressions

We ﬁrst follow Zhou (2016) to deﬁne

ς(x1, . . . , xt) = ln (1 + ext ln {1 + ext 1 ln [1 + . . . ln (1 + ex1)]})

as the stack-softplus function. Note that if t = 1, the stack-softplus function reduces to softplus function ς(x) = ln(1 + ex), which is often considered as a smoothed version of the rectiﬁer function, expressed as rectiﬁer(x) = max(0, x), that has become the dominant nonlinear activation function for deep neural networks (Nair and Hinton, 2010; Glorot et al., 2011; Krizhevsky et al., 2012; Le Cun et al., 2015). We then parameterize λizs = ln(1 πizs), the negative logarithms of the failure probabilities of the stick that category s is mapped to, as

k=1 rsk ς x β(2) sk , . . . , x β(T+1) sk , (9)

where the countably inﬁnite atoms (β(2) sk , . . . , β(T+1) sk ) and their weights {rsk}k constitute a draw from a gamma process Gs Ga P(G0, 1/cs) (Ferguson, 1973), with G0 as a ﬁnite and

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

continuous base distribution over a complete separable metric space Ωand 1/cs as a scale parameter. In other words, we let bizs Bernoulli(πizs) or

bizs Bernoulli

1+ex iβ(T+1) sk ln n 1+ex iβ(T ) sk ln h 1+. . . ln 1+ex iβ(2) sk io rsk #

As shown in Theorem 10 of Zhou (2016), bizs can be equivalently generated from a hierarchical model that convolves countably inﬁnite stacked gamma distributions, with covariate-dependent scales, as

θ(T) isk Gamma rsk, ex iβ(T +1) sk ,

θ(t) isk Gamma θ(t+1) isk , ex iβ(t+1) sk ,

θ(1) isk Gamma θ(2) isk, ex iβ(2) sk ,

bizs = 1(mis 1), mis =

k=1 m(1) isk, m(1) isk Pois(θ(1) isk), (11)

the marginalization of whose latent variables lead to (10). Note the gamma distribution θ Gamma(r, 1/c) is deﬁned such that E[θ] = r/c and var[θ] = r/c2, and the hierarchical structure in (11) can also be related to the augmentable gamma belief network proposed in Zhou et al. (2016). We consider the combination of (11) and either pa SB in (5) or par SB in (7) as the Bayesian nonparametric hierarchical model for multinomial softplus regression (MSR) that is deﬁned below.

Deﬁnition 1 (Multinomial Softplus Regression) With a draw from a gamma process for each category that consists of countably inﬁnite atoms β(2:T+1) sk with weights rsk > 0, where β(t) sk RP+1, given the covariate vector xi and category-stick mapping z, MSR parameterizes pis, the multinomial probability of category s, under the pa SB construction as

pis(z) = 1 Q k=1 1+ex iβ(T+1) sk ln n 1+ex iβ(T ) sk ln h 1+. . . ln 1+ex iβ(2) sk io rsk 1(zs =S)

1+ex iβ(T+1) jk ln 1+ex iβ(T ) jk ln 1+. . . ln 1+ex iβ(2) jk rjk #

and parameterizes pis under the par SB construction as

pis(z) = Q k=1 1+ex iβ(T+1) sk ln n 1+ex iβ(T ) sk ln h 1+. . . ln 1+ex iβ(2) sk io rsk 1(zs =S)

1+ex iβ(T+1) jk ln 1+ex iβ(T ) jk ln 1+. . . ln 1+ex iβ(2) jk rjk #

Zhang and Zhou

For the convenience of implementation, we truncate the number of atoms of the gamma process at K by choosing a discrete base measure for each category as Gs0 = PK k=1 γs0

K δβ(2:T +1) sk ,

under which we have rsk Gamma(γs0/K, 1/cs0) as the prior distribution for the weight of expert k in category s. For each category, we expect only some of its K experts to have non-negligible weights if K is set large enough, and we may use P k 1 P i m(1) isk > 0 , where

m(1) isk is deﬁned in (11), to measure the number of active experts inferred from the data.

4.2 Geometric Constraints for MSR

Since by deﬁnition we have pis(z) = πizs 1 P j<s pis(z) = πizs Q j<zs(1 πij) in MSR, it is clear that if πij for all j < zs are small and πizs is the ﬁrst one to have a large probability value close to one, yi will be likely assigned to category s regardless of how large the values of {πij}j>zs are. To motivate the use of the seemingly over-parameterized sum-stack-softplus function in (9), we ﬁrst consider the simplest case of K = T = 1. Without loss of generality, let us assume that the category-stick mapping is ﬁxed at z = (1, . . . , S).

Lemma 4 For pa SB-MSR with K = T = 1 and z = (1, . . . , S), the set of solutions to pis(z) > p0 in the covariate space are bounded by a convex polytope deﬁned by the intersection of s linear hyperplanes.

Note that the binary softplus regression with K = T = 1 is closely related to logistic regression, and reduces to logistic regression if r = 1 (Zhou, 2016). With Lemma 4, it is clear that even if an optimal category-stick mapping z is provided, pa SB-MSR with K = T = 1 may still clearly underperform MLR. This is because category s uses a single hyperplane to separate itself from the remaining S s categories, and hence uses the interaction of at most s hyperplanes to separate itself from the other S 1 categories. By contrast, MLR uses a convex polytope bounded by at most S 1 hyperplanes for each of the S categories. When K > 1 and/or T > 1, an exact theoretical analysis is beyond the scope of this paper. Instead we provide some qualitative analysis by borrowing related geometric-constraint analysis for softplus regressions in Zhou (2016). Note that Equation (10) indicates that a noisy-or model (Pearl, 2014; Srinivas, 1993), commonly appearing in causal inference, is used at each step of the sequential one-vs-remaining decision process; at each step, the binary outcome of an observation is attributed to the disjunctive interaction of many possible hidden causes. Roughly speaking, to enclose category s to separate it from the remaining S s categories in the covariate space, pa SB-MSR with K > 1 and T = 1 uses the complement of a convex-polytope-bounded space, pa SB-MSR with K = 1 and T > 1 uses a convexpolytope-like conﬁned space, and pa SB-MSR with both K > 1 and T > 1 uses a union of convex-polytope-like conﬁned spaces. For par SB-MSR with K + T > 1, the interpretation is the same except a convex polytope in pa SB will be replaced with the complement of a convex polytope, and vise versa. In contrast to SVMs using the kernel trick, MSRs using the original covariates might be more appealing in research areas, like biostatistics and sociology, where the interpretation of regression coeﬃcients and investigation of causal relationships are of interest. In addition, we ﬁnd that the classiﬁcation capability of MSRs could be further enhanced with data transformation, as will be discussed in Section 6.4.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

5. Bayesian Multinomial Support Vector Machine

Support vector machines (SVMs) are max-margin binary classiﬁers that typically minimize a regularized hinge loss objective function as

i=1 max(1 bix iβ, 0) + νR(β),

where bi { 1, 1} represents the binary label for the ith observation, R(β) is a regularization function that is often set as the L1 or L2 norm of β, ν is a tuning parameter, and x i is the ith row of the design matrix X = (x1, . . . , xn) . For linear SVMs, xi is the covariate vector of the ith observation, whereas for nonlinear SVMs, one typically set the (i, j)th element of X as the kernel distance between the covariate vector of the ith observation and the jth support vector. The decision boundary of a binary SVM is {x : x β = 0} and an observation is assigned the label yi = sign(x β), which means bi = 1 if x β 0 and bi = 1 if x β < 0.

5.1 Bayesian Binary SVMs

It is shown in Polson and Scott (2011) that the exponential of the negative of the hinge loss can be expressed as a location-scale mixture of normals as

L(bi | xi, β) = exp 2 max(1 bix iβ, 0)

1 2πωi exp 1

2 (1 + ωi bix iβ)2

Consequently, L(b | X, β) = Q i L(bi | xi, β) = exp { 2 P i max(1 bix iβ, 0)} can be regarded as a pseudo likelihood in the sense that it is unnormalized with respect to b = (b1, . . . , b N) { 1, 1}N. This location-scale normal mixture representation of the hinge loss allows developing close-form Gibbs sampling update equations for the regression coeﬃcients β via data augmentation, as discussed in detail in Polson and Scott (2011) and further generalized in Henao et al. (2014) to construct nonlinear SVMs amenable to Bayesian inference. While data augmentation has made it feasible to develop Bayesian inference for SVMs, it has not addressed a common issue that SVMs provide the predictions of deterministic class labels but not class probabilities. For this reason, below we discuss how to allow SVMs to predict class probabilities while maintaining tractable Bayesian inference via data augmentation. Following Sollich (2002) and Mallick et al. (2005), by deﬁning the joint distribution of β and {xi}i to be proportional to Q i[L(1 | xi, β) + L( 1 | xi, β)], one may deﬁne the conditional distribution of the binary label bi { 1, 1} as

P(bi | xi, β) =

1 1 + e 2bix iβ , for |x iβ| 1;

1 1 + e bi[x iβ+sign(x iβ)] , for |x iβ| > 1; (12)

which deﬁnes a probabilistic inference model that has the same maximum a posteriori (MAP) solution as that of a binary SVM for a given data set. Note that for MAP inference,

Zhang and Zhou

the penalty term νR(β) of the regularized hinge loss can be related to a corresponding prior distribution imposed on β, such as Gaussian, Laplace, and spike-and-slab priors (Polson and Scott, 2011).

5.2 pa SB Multinomial Support Vector Machine

Generalizing previous work in constructing Bayesian binary SVMs, we propose multinomial SVM (MSVM) under the pa SB framework that is distinct from previously proposed MSVMs (Crammer and Singer, 2002; Lee et al., 2004; Liu and Yuan, 2011). A Bayesian MSVM that predicts class probabilities has also been proposed before in Zhang and Jordan (2006), which, however, does not have a data augmentation scheme to sample the regression coeﬃcients in closed form, and consequently, relies on a random-walk Metropolis-Hastings procedure that may be diﬃcult to tune. Redeﬁning the label sample space from bi { 1, 1} to bi {0, 1}, we may rewrite (12) as bi | xi, β Bernoulli[πi, svm(xi, β)], where

πi, svm(xi, β) =

1 1 + e 2xiβ , for |x iβ| 1; 1 1 + e xiβ sign(x iβ) , for |x iβ| > 1. (13)

The Bernoulli likelihood based cross-entropy-loss binary classiﬁer, whose covariate-dependent probabilities are parameterized as in (13), is exactly what we need to extend the binary SVM into a multinomial classiﬁer under pa SB introduced in Theorem 1. More speciﬁcally, given the category-stick mapping z, with the success probabilities of the stick that category s is mapped to parameterized as πizs, svm(xi, βs) and binary stick variables drawn as bizs Bernoulli[πizs, svm(xi, βs)], we have the following deﬁnition.

Deﬁnition 2 (pa SB multinomial SVM) Under the pa SB construction, given the covariate vector xi and category-stick mapping z, multinomial support vector machine (MSVM) parameterizes pis, the multinomial probability of category s, as

pis(z) = [πizs, svm(xi, βs)]1(zs =S) Y

j:zj<zs πizj, svm(xi, βj).

Note that there is no need to introduce par SB-MSVM in addition to pa SB-MSVM, since by deﬁnition, we have πizs, svm(xi, βs) = 1 πizs, svm(xi, βs) for all s.

6. Example Results

Constructed under the pa SB framework, a multinomial regression model of S categories is characterized by not only how the S stick-speciﬁc binary classiﬁers with cross entropy loss parameterize their covariate-dependent probability parameters, but also how its S categories are one-to-one mapped to S latent sticks. To investigate the unique properties of a pa SB multinomial regression model, we will study the beneﬁts of both inferring an appropriate mapping z and increasing the modeling capacity of the underlying binary regression model. For illustration purpose, we will focus on multinomial softplus regression (MSR) whose capacity and complexity are both explicitly controlled by K and T.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

6.1 Inﬂuence of Binary Regression Model Capacity

We ﬁrst consider the Iris data set with S = 3 categories. We choose the sepal and petal lengths as the two dimensional covariates to illustrate the performance of MSR under four diﬀerent settings. We ﬁx z = (1, 2, 3), which means category s is mapped to stick s for all s, but choose diﬀerent model capacities by varying K and T. Examining the relative 2D spatial locations of the observations, where the blue, black, and gray points are labeled as category 1, 2, and 3, respectively, one can imagine that setting z = (2, 1, 3), which means mappings categories 2, 1, and 3 to the 1st, 2nd, and 3rd sticks, respectively, will already lead to excellent class separations for MSR with K = T = 1, according to the analysis in Section 4.2 and also conﬁrmed by our experimental results (not shown for brevity). More speciﬁcally, with the 2nd, 1st, and 3rd categories mapped to the 1st, 2nd, and 3rd sticks, respectively, one can ﬁrst use a single hyperplane to separate category 2 (black points) from both categories 1 (blue points) and 3 (gray points), and then use another hyperplane to separate category 1 (blue points) from category 3 (gray points). However, when the mapping is ﬁxed at z = (1, 2, 3), as shown in the ﬁrst row of Figure 1, MSR with K = T = 1 performs poorly and fails to separate out category 1 (blue points) right in the beginning. This is not surprising since MSR with K = T = 1 is only equipped with a single hyperplane to separate the category that the ﬁrst stick is mapped to (category z1 = 1 in this case) from the others, whereas for this data set it is apparent at least two hyperplanes are required to separate the blue from the black and gray points. MSR with K = 5 and T = 1 also fails to work with z = (1, 2, 3), as shown in the third row of Figure 1, which is also not surprising since it can only use the complementary of a convex-polytopebound conﬁned space to enclose category z1 = 1, but the blue points can not be enclosed in such a manner. Despite purposely enforcing an unfavorable category-stick mapping, once we increase T, the performance quickly improves, which is expected since T > 1 allows using a single (if K = 1 as in the second row) or a union (if K > 1 as in the fourth row) of convex-polytope-like conﬁned spaces to separate one category from the others (by enclosing the positively labeled observations in each stick-speciﬁc binary classiﬁcation task). The results in Figure 1 show that even an unoptimized category-stick mapping, which is unfavorable to MSR with small K and/or T, is enforced, empowering each stick-speciﬁc binary regression model with a higher capacity (using larger K and/or T) can still allow MSR to achieve excellent separations. It is also simple to show that for the data set in Figure 1, even if one chooses low-capacity stick-speciﬁc binary regression models by setting T = 1, one can still achieve good performance with MSR if the category-stick mapping is set as z = (2, 1, 3), z = (3, 1, 2), z = (2, 3, 1), or z = (3, 2, 1). That is to say, as long as it is not category 1 (blue points) that is mapped to stick 1, MSR with T = 1 is able to provide satisfactory performance.

6.2 Inﬂuence of Category-Stick Mapping and its Inference

The Iris data set in Figure 1 provides an instructive example to show not only the importance of increasing the model capacity if a poor category-stick mapping is imposed, but also the importance of optimizing the category-stick mapping if the capacities of these stick-speciﬁc binary regression models are limited. To further illustrate the beneﬁts of inferring an appropriate category-stick mapping z, we consider the square data set shown in Figure 2.

Zhang and Zhou

Figure 1: Log-likelihood plots and predictive probability heat maps for the 2-D iris data with a ﬁxed category-stick mapping z = (1, 2, 3). Blue, black, and gray points are labeled as categories 1, 2, and 3, respectively. For the ﬁrst row, K = 1 and T = 1, second row, K = 1 and T = 3, third row, K = 5 and T = 1, and fourth row, K = 5 and T = 3. The log-likelihood plots are shown in Column 1, and the predictive probability heat maps of categories 1 (blue), 2 (black), and 3 (gray) are shown in Columns 2, 3, and 4, respectively.

We show that for MSR, even if both K and T are suﬃciently large to allow each stick-speciﬁc binary regression model to have a high enough capacity, whether an optimal category-stick mapping is selected may still clearly matter for the performance. As shown in the ﬁrst three rows of Figure 2, with K = T = 10, three diﬀerent z s are considered and z = (1, 2, 3) (shown in the ﬁrst row) is found to perform the best. As shown in the fourth row, we sample z using (8) within each MCMC iteration and achieve a result that seems as good as ﬁxing z = (1, 2, 3). In fact, we ﬁnd that our inferred mappings switch between z = (1, 2, 3) and z = (1, 3, 2) during MCMC iterations, indicating that the Markov chain is mixing well. These results suggest the importance of both learning the mapping z from the data and allowing the stick-speciﬁc binary classiﬁers to have enough capacities to model nonlinear classiﬁcation decision boundaries. When sampling z = (z1, . . . , z S) that the S categories are mapped to, although S! permutations of (1, . . . , S) can become enormous as S increases, the eﬀective search space could be much smaller if many diﬀerent mappings imply similar likelihoods and if these extremely poor mappings can be easily avoided. Rather than searching for the best mapping,

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Figure 2: Log-likelihood plots and predictive probability heat maps for the square data with K = T = 10. The blue, black, and gray points are labeled as categories 1, 2, and 3, respectively. We ﬁx the category-stick mapping as z = (1, 2, 3) for Row 1, (2, 1, 3) for Row 2, and (3, 1, 2) for Row 3, and sample z for Row 4. The log-likelihood plots are shown in Column 1, and the predictive probability heat maps of categories 1 (blue), 2 (black), and 3 (gray) are shown in Columns 2, 3, and 4, respectively.

the proposed MH step, proposing two indices zj and zj to switch in each iteration, is a simple but eﬀective strategy to escape from the mappings that lead to poor ﬁts. Note that the probability of a zj not being proposed to switch after t MCMC iterations is [(S 2)/S]t. Even if S is as large as 100, this probability is less than 10 8 at t = 1000. Also note the iteration at which zj is proposed to switch at the ﬁrst time follows a geometric distribution, with success probability 2/S. Thus S/2 is the expected number of iterations for a zj to be proposed to switch once.

To demonstrate the eﬃciency of our permutation scheme, we construct square101, a synthetic two-dimensional data set consisting of 101 categories. We generate 8000 data points that are uniformly at random distributed within the 12 12 spatial region occupied by all 101 categories. The decision boundaries of diﬀerent classes are displayed in Figure 3(a), where the data points placed within the outside square frame, whose outer and inner dimensions are 12 and 10, respectively, are assigned to category 1, and these placed within the sth unit square, where s {2, . . . , 101}, inside the square frame are assigned to category s.

Zhang and Zhou

Although it is almost impossible to search for the best category-stick mapping z giving rise to the highest likelihood from all 101! 10160 possible mappings, we show our permutation scheme is very eﬀective in escaping from poor mappings, leading to a performance that is comparable to the best of those obtained with pre-ﬁxed suboptimal mappings. More speciﬁcally, applying the analysis in Section 4.2 to Figure 3(a), we expect an a SB-MSR to perform well under a ﬁxed suboptimal category-stick mapping z, where z1 = 1, which means the outside square frame is mapped to stick 1, and the squares closer to the inner boundary of the square frame are mapped to the sticks broken at earlier stages; the mapping z = (1, 2, , 101) is such an example. In other words, we ﬁrst separate the frame from all the other squares, and then sequentially separate the squares from the remainders; the closer a square is from the frame, the earlier it is separated. The total number of suboptimal mappings z s constructed in this manner is as large as 36! 28! 20! 12! 4! 1099.5. First, we uniformly at random generate 3600 diﬀerent suboptimal mappings z s under this construction, run a SB-MSR with K = T = 4, and plot the histogram of the 3600 log-likelihoods in Figure 3(b). Second, we start from 3600 randomly initialized z, run pa SB-MSR with K = T = 4, and also plot the histogram of the 3600 log-likelihoods in Figure 3(b). For each run, we choose 20,000 MCMC iterations and collect the last 1000 MCMC samples. Each log-likelihood is averaged over those of the corresponding model s collected MCMC samples. As in Figure 3(b), the log-likelihood from a pa SB-MSR is in general clearly larger than that of an a SB-MSR with a ﬁxed suboptimal z, and there is little overlap between their corresponding histograms. Further examining the 3600 z s inferred by pa SB at its last MCMC iteration shows that 3482 of them have z1 = 1 and all of them have z1 5. Suppose z1 / {1, 2, 3, 4, 5} at the current iteration, which means category 1 is mapped to none of the ﬁrst ﬁve sticks, then the probability of not only selecting stick z1, but also switching it with one of the ﬁrst ﬁve sticks in the MH proposal is 1 101 5 100. Thus the probability that category 1 has never been proposed to mapped to one of the ﬁrst ﬁve sticks after t iterations is [1 5/(101 100)]t, which becomes as small as 0.005% at t = 20, 000, demonstrating the eﬀectiveness of our permutation scheme in dealing with a large number of categories. Note we have also tried 3600 a SB-MSR, each of which is provided with a randomly initialized z. The log-likelihoods, however, are all far below 4000 and hence not included for comparison. This phenomenon is not surprising, as the probability for a randomly initialized z to be suboptimal is as tiny as 36! 28! 20! 12! 4!/101! 10 60.5. Figure 4 empirically demonstrates the eﬀectiveness of permuting z on the satimage data set, using MSRs with K = 5, T = 3, and z ﬁxed at each of the 6! = 720 possible one-to-one category-stick mappings. Panels (a) and (b) show the log-likelihood histograms for MSRs constructed under augmented SB (a SB) and augmented reversed SB (ar SB), respectively. Both histograms are clearly left skewed, indicating under both a SB and ar SB, only a small proportion of the 720 diﬀerent category-stick mappings lead to very poor ﬁts. The blue vertical lines at 1203.82 in (a) and 1350.21 in (b) are the log-likelihoods by pa SB and par SB, respectively, in both of which the category-stick mapping z is updated by a MH step in each MCMC iteration. Only 20 (97) out of 720 a SB-MSRs (ar SB-MSRs) have a higher likelihood than pa SB-MSR (par SB-MSR). Since in the stick-breaking construction, the binary classiﬁer that separates a category mapped to a smaller-indexed stick from the others utilizes fewer constraints, the classiﬁcation can be poor if the complexity of the decision boundary goes beyond the nonlinear

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

4200 4000 3800 3600 3400 log likelihood

Figure 3: (a) Illustration of the square101 data and (b) log-likelihood histograms, by a SBMSR with 3600 random suboptimal category-stick mappings and by pa SB-MSR with 3600 randomly initialized category-stick mappings.

1500 1400 1300 1200 log likelihood

1800 1700 1600 1500 1400 1300 log likelihood

Figure 4: Log-likelihood histograms for MSRs using all 720 possible category-stick mappings, constructed under (a) augmented stick breaking (a SB) and (b) augmented and reversed stick breaking (ar SB). The blue lines in (a) and (b) correspond to the log-likelihoods of pa SB-MSR and par SB-MSR, respectively.

modeling capacity of the binary classiﬁer. However, even with a low-capacity binary classiﬁer, the performance could be signiﬁcantly improved if that diﬃcult-to-separate category is mapped to a larger-indexed stick, for which there are fewer categories left to be separated in its one-vs-remaining binary classiﬁcation problem. Examining the z s associated with the 100 lowest log-likelihoods in Figure 4, we ﬁnd there are 51 mappings belonging to the set {z : z5 = 1 or z6 = 1} in a SB, and 77 belonging to {z : z3 = 1 or z6 = 1} in ar SB. It suggests that separating Categories 5 or 6 (Categories 3 or 6) from all the other categories might be beyond the capacity of a binary softplus regression with K = 5 and T = 3 under

Zhang and Zhou

0 50 100 150 200 250

G G G G G G

1 2 3 4 5 6 7 8 9 10

G category 1 category 2 category 3

Figure 5: Inferred expert weights rk in descending order for each category of the square data with K = T = 10.

the a SB (ar SB) construction. But if breaking the sticks associated with these categories at late stages, we only need to separate them from fewer remaining categories, which could be much easier. We have further examined the other 620 arrangements, and found no evident patterns. These observations suggest that the eﬀective search space of the mapping z is considerably smaller than S!, and the proposed MH step is eﬀective in escaping from poor category-stick mappings. In pa SB-MSVM, we use a Gaussian radial basis function kernel, whose kernel width is cross validated from a set of predeﬁned candidates. We ﬁnd its performance to be sensitive to the setting of the kernel width, which is a common issue for SVMs (Cherkassky and Ma, 2004; Soares et al., 2004; Chang et al., 2005). If an appropriate kernel width could be identiﬁed through cross validation, we ﬁnd that learning the mapping z becomes less important for pa SB-MSVM to perform well. However, we ﬁnd that if the kernel width is not well selected, which can happen if all candidate kernel widths are far from the optimal value, the binary classiﬁer for each category may not have enough capacity for nonlinear classiﬁcation and the learning of the category-stick mapping z could then become important.

6.3 Turning OﬀUnneeded Model Capacities

While one can adjust both K and T to control the capacity of binary softplus regression, for MSR, the total number of experts K is a truncation level that can be set as large as permitted by the computation budget. This is because the truncated gamma process used by each stick-speciﬁc binary softplus regression shrinks the weights of unnecessary experts towards zeros. Figure 5 shows in decreasing order the inferred weights of the experts belonging to each of the 3 categories of the square data set. These weights are inferred by MSR with K = T = 10 and the learning of z, as in the fourth row of Figure 2. It is clear from Figure 5 that only a small number of experts are inferred with non-negligible weights in the posterior, and the number of active experts and their weights indicate the complexity

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Figure 6: First row: classiﬁcation of a 2-D swiss roll data by pa SB-MSR with K = 5, T = 3, using the original covariates. Second row: pa SB-MSR with K = 5, T = 1 trained on the covariates transformed via the pa SB-MSR used in the ﬁrst row. In each row, the left column plot the log-likelihood against MCMC iteration, and the middle and right columns show the predictive probability heatmaps for Category 1 (black points) and Category 2 (blue points), respectively.

of the corresponding classiﬁcation decision boundaries shown in the fourth row of Figure 2. We note that while T is a parameter to be set by the user, we ﬁnd increasing it increases model capacity, without observing clear signs of overﬁtting for all the data considered here.

6.4 MSR with Data Transformation

Kernel SVMs transform the data to make diﬀerent categories more linearly separable in the transformed covariate space. While kernel SVMs may provide high nonlinear modeling capacity, its performance could be sensitive to the kernel width, which often needs to be cross validated, and its number of support vectors often increases linearly in the size of the training set. By contrast, MSRs rely on the interactions of linear hyperplanes to construct nonlinear decision boundaries, as discussed in Section 4.2, and hence may have insuﬃcient capacity for highly complex nonlinearity. However, we may simply stack another MSR on a previously trained MSR to quickly enhance its nonlinear modeling capacity. In particular, we may ﬁrst run a MSR to obtain a ﬁnite set of hyperplanes denoted by β (t+1) jk . We may

Zhang and Zhou

then augment the original covariate vector xi as

xi := x i, log 1 + ex i β (2) 11 , , log 1 + ex i β (t+1) jk , , log 1 + ex i β (T +1) SK (14)

and run another MSR with the transformed covariates xi. For illustration, we show the eﬃcacy of this data-transformation strategy on a 2-D swiss roll data in Figure 6. The ﬁrst row shows the results of MSR with K = 5 and T = 3, using the original covariates xi, while the second row shows MSR with K = 5 and T = 1, using the transformed covariates xi deﬁned by (14), where the regression coeﬃcient vectors β (t+1) jk are learned using the MSR illustrated in the ﬁrst row. It is evident that the classiﬁcation is greatly improved in terms of both training log-likelihood and out-of-sample predictions.

6.5 Results on Benchmark Data Sets

To further evaluate the performance of the proposed pa SB multinomial regression models, we consider pa SB multinomial logistic regression (pa SB-MLR), pa SB multinomial robit with κ = 6 degrees of freedom (pa SB-robit), pa SB multinomial support vector machine (pa SB-MSVM), and MSRs. We compare their performance with those of L2 regularized multinomial logistic regression (L2-MLR), support vector machine (SVM), and adaptive multi-hyperplane machine (AMM), and consider the following benchmark multi-class classiﬁcation data sets: iris, wine, glass, vehicle, waveform, segment, dna, and satimage. We also include the synthetic square data shown in Figure 2 for comparison. For SVM we use the LIBSVM package, which trains S(S 1)/2 one-vs-one binary classiﬁers and makes prediction using majority voting (Chang and Lin, 2011). We run LIBSVM in R with package e1071 (Meyer et al., 2015). We consider MSRs with (K, T) as (1, 1), (1, 3), (5, 1), and (5, 3), respectively. We also consider MSR with data transformation (DT-MSR), in which we ﬁrst train a MSR with K = 5 and T = 3 to transform the covariates and then stack another MSR with K = 5 and T = 1. We provide detailed descriptions on the data and experimental settings in the Appendix. With the number of categories in parentheses right after the data set names, we summarize in Table 1 the classiﬁcation error rates by various models, where those of MSRs are calculated by averaging over pa SB and par SB. Table 1 shows that an MSR with K or T suﬃciently large generally outperforms pa SB-MLR, pa SB-robit, L2-MLR, and AMM, and using another MSR on the transformed covariates can in general further reduce the error rate. This is especially evident when there are nonlinearly separable categories, as indicated by a clearly higher error rate of L2-MLR in contrast to that of SVM. One may notice that pa SB-robit, pa SB-MLR, and MSR with K = T = 1 are similar to L2-MLR in terms of performance, suggesting the eﬀectiveness of the proposed permutation scheme, which helps mitigate the potential adverse eﬀects of having asymmetric class labels. One may also note that pa SB-robit outperforms pa SB-MLR on glass, vehicle, waveform, dna, and satimage, indicating there are beneﬁts in using a robust classiﬁer on these data sets. Comparable error rates of pa SB-MSVM to SVM and better performance of MSRs on most data sets demonstrate the success of the pa SB framework in transforming a binary classiﬁer with cross entropy loss into a Bayesian multinomial one. To further check whether a pa SB model is attractive when fast out-of-sample prediction is desired, we consider using only the MCMC sample that has the highest training likelihood

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Data (S) pa SBMLR pa SBrobit pa SBMSVM K = 1 T = 1 K = 1 T = 3 K = 5 T = 1 K = 5 T = 3 DTMSR L2MLR SVM AMM

square(3) 59.52 67.46 0 57.14 15.08 0 0 0 62.29 4.76 16.67 iris(3) 4.00 5.33 3.33 4.67 4.00 4.00 4.00 3.33 3.33 4.00 4.67 wine(3) 4.44 5.00 2.78 2.78 2.22 2.78 2.22 2.78 3.89 2.78 3.89 glass(6) 35.35 34.88 29.30 33.49 26.05 31.16 32.09 26.37 33.02 28.84 37.67 vehicle(4) 23.23 21.25 17.32 22.44 17.32 17.72 15.75 14.96 22.83 18.50 21.89 waveform(4) 17.87 16.42 15.76 19.84 16.62 15.67 15.04 15.56 15.60 15.22 18.54 segment(7) 7.36 8.03 7.98 6.20 6.49 6.45 5.63 7.65 8.56 6.20 12.47 dna(3) 5.06 4.05 5.31 4.13 4.47 4.55 4.22 3.88 5.98 4.97 5.43 satimage(6) 20.65 17.25 8.90 16.65 14.45 12.85 12.00 9.85 17.80 8.50 15.31

Table 1: Comparison of the classiﬁcation error rates (%) of pa SB-MLR, pa SB-robit, pa SBMSVM, MSRs with various K and T (columns 5 to 8), MSR with data transformation (DT-MSR), L2-MLR, SVM, and AMM.

among the collected ones for all pa SB models, and summarize in Table 4 of the Appendix the classiﬁcation error rates of various models, with the number of inferred support vectors or active hyperplanes included in parenthesis. Following the deﬁnition of active experts in Zhou (2016), we deﬁne for MSRs the number of active hyperplanes as T PS s e Ks where e Ks is the number of active experts for class s. The number of active hyperplanes determines the computational complexity for out-of-sample prediction with a single MCMC sample, which is O(T PS s e Ks). Since the error rates of MSRs in Table 4 are calculated by averaging over both pa SB and par SB, the number of active hyperplanes is T PS s ( e K(pa SB) s + e K(par SB) s ).

Shown in Figure 8 in the Appendix are boxplots of the number of each category s active experts for MSR with K = 5 and T = 3. Except for several categories of satimage that require all K = 5 experts for par SB-MSR, K = 5 is large enough to provide the needed model capacity under all the other scenarios. As shown in Table 4, MSRs with suﬃciently large K and/or T are comparable to both SVM and pa SB-MSVM in terms of the error rates, while clearly outperforming them in terms of the number of (active) hyperplanes/support vectors and hence computational complexity for out-of-sample predictions. While MSR with K = T = 1, pa SB-MLR, and pa SB-robit generally perform worse than SVM in terms of the error rates, they use much fewer hyperplanes and hence have signiﬁcantly lower computation for out-of-sample predictions. In summary, MSR whose upper-bound for the number of active expects K and number of layers for each expert T can both be adjusted to control its capacity of modeling nonlinearity, can achieve a good compromise between the accuracy and computational complexity for out-of-sample prediction of multinomial class probabilities, and can be further improved by training an additional MSR on the transformed covariates.

We further measure how well the Gibbs sampler is mixing using eﬀective sample size (ESS) for both pa SB-MLR and Bayesian multinomial logistic regression (Bayes MLR) of Polson et al. (2013). For both algorithms we let βj N 0, diag(α 1 j0 , . . . , α 1 j V ) , where αjv Gamma(0.001, 1/0.001). The ESS (Holmes and Held, 2006) of a parameter or a function of parameters is deﬁned as ESS = L/ [1 + 2 P h=1 ρ(h)] , where L is the number of post-burn-in samples, ρ(h) is the hth autocorrelation of the parameter or the function of parameters. It describes how quickly an MCMC algorithm generates independent samples.

Zhang and Zhou

10% quantile median 90% quantile training testing training testing training testing Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR Bayes MLR pa SBMLR square 411.46 194.54 421.62 256.48 913.17 948.53 858.46 952.33 924.48 973.60 927.93 976.33 iris 82.30 90.33 73.75 84.40 149.47 218.21 156.25 174.16 331.71 854.45 341.39 793.00 wine 194.41 314.41 58.41 56.30 467.73 859.67 506.66 643.90 926.55 991.46 958.12 994.77 glass 70.18 162.69 67.81 138.90 137.18 335.14 122.27 329.21 359.75 686.43 347.40 615.02 vehicle 77.71 103.30 74.05 101.64 133.66 230.83 127.22 230.48 426.44 460.07 414.48 453.87 waveform 123.77 120.01 120.99 113.94 199.00 203.38 191.77 209.61 310.84 499.05 291.96 478.59 segment 114.91 104.77 95.63 91.31 281.11 294.17 270.63 355.46 742.63 844.11 752.74 814.62 dna 217.99 238.48 63.66 67.26 481.65 736.58 505.59 772.32 911.68 986.60 927.30 991.63 satimage 53.51 66.19 54.04 66.33 82.50 90.50 81.87 90.11 160.84 168.85 156.83 157.31

Table 2: Comparison of the ESS of the conditional class probability between Bayes MLR and pa SB-MLR.

Since the Gibbs sampler of Bayes MLR samples one βj conditioning on all βj for j = j, which may lead to strong dependencies between diﬀerent categories and hence slow down the mixing of the Markov chain. By contrast, the βj s are conditionally independent given the augmented variables bij s in pa SB-MLR, which may lead to faster mixing. For both Bayes MLR and pa SB-MLR, we consider ﬁve independent random trials, in each of which we randomly initialize the model parameters, run 10,000 Gibbs sampling iterations, and collect the last 1,000 MCMC samples of βj. We use the mcmcse package (Flegal et al., 2016) to estimate the ESS of each pij in a random trial using the 1,000 collected MCMC samples. For the training set, we calculate the 10% quantile, median, and 90% quantile of the ESSs of all pij for each random trial, and then report their averages over the ﬁve random trials in Table 2. For the testing set, we follow the same steps and report the results in Table 2. While pa SB-MLR underperforms Bayes MLR on some of the data sets for the 10% ESS quantile, they consistently outperform Bayes MLR on all data sets for both the ESS median and 90% ESS quantile, for both training and testing.

6.6 Robustness of pa SB-Robit Regression

We use the contaminated vehicle data to demonstrate the robustness of pa SB-robit. As discussed by Liu (2004), the heavy-tailed conditional class probability function of robit regression can robustify the decision boundary when there exist outliers. We use the vehicle training set as inliers, synthesize outliers that are far from inliers, combine both as the new training set, and keep the testing set unchanged. We generate diﬀerent numbers of outliers so that the ratio of outliers to inliers varies from 0, 0.1, 0.2, 0.3, to 0.5, at each of which we randomly simulate 10 diﬀerent sets of outlier covariates. We provide the details on how we generate outliers in the Appendix.

We compare L2-MLR and pa SB-robit with κ = 1 degree of freedom on the contaminated vehicle data. Figure 7 shows the prediction error rate (mean standard deviation) of the testing set for diﬀerent outlier-inlier ratios. When there are no outliers, both approaches delivers comparable performances. As the ratio increases, pa SB-robit with κ = 1 more and more clearly outperforms L2-MLR, which justiﬁes the robustness of pa SB-robit.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

0.0 0.1 0.2 0.3 0.4 0.5 outliers : inliers

prediction error rate (%)

MLR pa SB robit

Figure 7: Prediction error rates (%, mean standard deviation) for diﬀerent ratios of outliers to inliers.

7. Conclusions

To transform a cross-entropy-loss binary classiﬁer into a Bayesian multinomial regression model and derive eﬃcient Bayesian inference, we develop a permuted and augmented stickbreaking construction. With permutation, we one-to-one map the categories to sticks to escape from poor category-stick mappings that impose restrictive geometric constraints on the decision boundaries, and with augmentation, we link a category outcome to conditionally independent stick-speciﬁc covariate-dependent Bernoulli random variables. We illustrate this general framework by extending binary softplus regression, robit regression, and support vector machine into multinomial ones. Experiment results validate our contributions and show that the proposed multinomial softplus regressions achieve a good compromise between interpretability, complexity, and predictability.

Acknowledgments

The authors would like to thank the editor and two anonymous referees for their insightful and constructive comments and suggestions, and Texas Advanced Computing Center for computational support.

Appendix A. Additional Lemma and Proofs

Proof [Proof of Theorem 1] The conditional probability of yi given {zs, πis}1,S can be expressed as

P(yi = s | {zs, πis}1,S) = P bij:j>zs [P(bizs = 1)]1(zs =S) h Q j<zs P(bij = 0) i h Q j>zs P(bij) i

= [P(bizs = 1)]1(zs =S) h Q j<zs P(bij = 0) i P bij:j>zs h Q j>zs P(bij) i ,

which becomes the same as (4) by applying (5) and P bij:j>zs Q j>zs P(bij) =1.

Zhang and Zhou

Proof [Proof of Lemma 3] Under the pa SB construction, the probability ratio of categories (choices) s and s + d is a function of the stick success probabilities πzs, πz(s+1), , πz(s+d). More speciﬁcally,

pis(z) = π 1(z(s+d) =S) iz(s+d)

h Q zs j<z(s+d)(1 πij) iδ(zs z(s+d))

π1(zs =S) izs h Q z(s+d) j<zs(1 πij) iδ(zs>z(s+d)) .

Proof [Proof of Lemma 4] Since pis(z) = h 1 1 + ex iβs rsi1(s =S) Q j<s 1 + ex iβj rj

when K = T = 1 and z = (1, . . . , S), the set of solutions to pis > p0 are bounded by the set of solutions to 1 + ex iβj rs > p0, j {1, . . . , s 1}, and 1 1 + ex iβs rs > p0, and hence bounded by the convex polytope deﬁned by the set of solutions to the s inequalities as

x i[( 1)1(j=s)βj] < ( 1)1(j=s) ln h p1(j =s) 0 (1 p0)1(j=s)i 1

rj 1 , j {1, . . . , s}.

Lemma 5 Without loss of generality, let us assume that the category-stick mapping is ﬁxed at z = (1, 2, , S). The pa SB multinomial logistic model that assigns choice s {1, . . . , S} for individual i with probability pis = (πis)1(s =S) Q j<s(1 πij), where πis = 1/(1 + e Wis), can be considered as a sequential random utility maximization model. This model selects choice s once Uis > P j s Uij is observed for s = 1, . . . , S, where Uis are deﬁned as

Ui1 = Ui2 + + Ui S + Wi1 + εi1,

j>s Uij + Wis + εis,

Ui(S 1) = Wi(S 1) + εi(S 1),

and εis i.i.d. Logistic(0, 1) are independent, and identically distributed (i.i.d.) random variables following the standard logistic distribution.

Proof [Proof of Lemma 5] Note that P(ε < x) = 1/(1 + e x) if ε Logistic(0, 1). First consider the choice of individual i be yi = 1, which would happen with probability

P(yi = 1) = P Ui1 > X

j 1 Uij = P(εi1 > Wi1) = 1/(1 + e Wi1) = πi1 = pi1.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

square iris wine glass vehicle waveform segment dna satimage Train size 294 120 142 171 592 500 231 2000 4435 Test size 126 30 36 43 254 4500 2079 1186 2000 Covariate number 2 4 13 9 18 21 19 180 36 Category number 3 3 3 6 4 3 7 3 6

Table 3: Multi-class classiﬁcation data sets used in experiments for model comparison.

Then for s = 2, , S 1,

P(yi = s) = P(yi = s | yi > s 1)P(yi > s 1)

= P Uis > X

j s 1 P Uij < X

= P(εis > Wis) Y

j s 1 P(εij < Wij)

j s 1(1 πij)

Finally, P(yi = S) = 1 P j<S P(yi = j) = Q j<S(1 πij) = pi S.

Appendix B. Experimental Settings and Additional Results

The table below summarizes the sizes of both training and testing sets, and the number of covariates and categories. The training and testing sets are predeﬁned for vehicle, dna, and satimage. Note that the training and validation sets are combined as training. We divide the other data sets into training and testing as follows. For iris, wine, and glass, ﬁve random partitions are taken such that for each partition the training set accounted for 80% of the whole data set while the testing set 20%. The classiﬁcation error rate is calculated by averaging the error rates of all ﬁve random partitions. For square, waveform, and segment, only one random partition is taken, where 70% of the square data set are used as training and the remaining 30% as testing, and 10% of both the waveform and segment datas are used as training and the remaining 90% as testing. We compare pa SB-MLR, pa SB-robit with κ = 6, pa SB-MSVM, and MSR with three other models, including L2 regularized multinomial logistic regression (L2-MLR), support vector machine (SVM), and adaptive multi-hyperplane machine (AMM). For pa SB-robit, we run 8,000 iterations and discard the ﬁrst 5,000 as burn-in (this setting is unchanged for experiments in Section 6.6). For pa SB-MSVM, we use the spike-and-slab prior to select the kernel bases and set 0.5 as the probability of spike at 0, which is referred to as a uniform prior by Polson and Scott (2011). A Gaussian radial basis function (RBF) kernel is used and the kernel width is selected by 3-fold cross validation from (2 10, 2 9, . . . , 210). We run 1000 MCMC iterations and discard the ﬁrst 500 as burn-in samples. For MSR, we try both pa SB and par SB with (K, T) set as (1, 1), (1, 3), (5, 1), or (5, 3). We run 10000 MCMC iterations and discard the ﬁrst 5000 as burn-in samples. The predictive probability is calculated by averaging the Monte Carlo average predictive probabilities from pa SB and

Zhang and Zhou

pa SBMLR pa SBrobit pa SBMSVM K = 1 T = 1 K = 1 T = 3 K = 5 T = 1 K = 5 T = 3 DT-MSR L2-MLR SVM AMM

square 53.17(2) 57.94(2) 0(256) 57.14(4) 13.49(12) 1.59(10) 0.79(33) 1.59(40) 62.29(2) 4.76(22) 16.67(7) iris 2(2) 6.67(2) 3.33(97.6) 4.67(4) 4.67(12) 4(5.4) 3.33(12) 4.67(16.2) 3.33(2) 4(35) 4.67(8.6) wine 8.33(2) 4.86(2) 2.14(125.8) 4.45(4) 4.45(12) 6.67(4) 3.34(12) 3.89(16.4) 3.89(2) 2.78(77.2) 3.89(7.8) glass 39.07(5) 34.41(5) 30.23(137.6) 35.35(10) 30.7(30) 33.02(10.4) 35.81(33.2) 32.56(47.8) 33.02(5) 28.84(118) 37.67(23.8) vehicle 25.98(3) 22.44(3) 17.71(592) 23.62(6) 21.65(18) 18.9(12) 16.93(33) 18.11(45) 22.83(3) 18.50(256) 21.89(17) waveform 18.78(2) 16.84(2) 16.56(500) 19.73(4) 17.11(12) 17.07(6) 17.11(18) 16.49(23) 15.60(2) 15.22(212) 18.54(11.6) segment 7.07(6) 9.81(6) 9.86(231) 8.61(12) 8.37(36) 7.31(13) 7.79(36) 8.85(50) 8.56(6) 6.20(93) 12.47(11.4) dna 6.58(2) 4.30(2) 7.25(1701) 5.56(4) 5.73(12) 6.07(7) 5.82(12) 4.72(17) 5.98(2) 4.97(1142) 5.43(18.6) satimage 21.35(5) 16.40(5) 9.5(4315) 15.8(10) 14.7(30) 13.25(34) 11.95(102) 11.55(105) 17.80(5) 8.50(1652) 15.31(16.8)

Table 4: Comparison of classiﬁcation error rates (%) of pa SB-MLR, pa SB-robit, pa SBMSVM, MSRs with various K and T (results column 3 to 6), MSR with data transformation (DT-MSR), L2-MLR, SVM, and AMM, using the collected MCMC sample with the highest log-likelihood. The number of active hyperplanes/support vectors used for out-of-sample predictions are shown in parenthesis.

active experts

par SB pa SB

Figure 8: Boxplots of the number of active experts inferred by pa SB/par SB MSRs with K = 5 and T = 3.

par SB MSRs. An observation in the testing set is classiﬁed to the category associated with the largest predictive probability. For MSRs on benchmark data with data transformation, in the ﬁrst step, we run pa SBMSR and par SB-MSR with K = 5 and T = 3 on the original covariates for 3,000 iterations to learn z, {rjk}, {β(t) jk }. We then transform the covariates by Equation (6.4), where β(t) jk are associated with active experts. In the last step, we use z learned in the ﬁrst step, and run pa SB-MSR and par SB-MSR with K = 5 and T = 1 for 10,000 iterations and collect the last 5,000 samples to compute the predictive probabilities. We use the L2-MLR provided in the LIBLINEAR package (Fan et al., 2008) to train a linear classiﬁer, where a bias term is included and the regularization parameter is ﬁve-

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

fold cross-validated on the training set from (2 10, 2 9, . . . , 215). We run the LIBLINEAR package in R via R package Liblinea R (Helleputte, 2015). We also classify an observation to the category associated with the largest predictive probability. For SVM, we use the LIBSVM package (Chang and Lin, 2011) and run it in R with R package e1071 (Meyer et al., 2015). A Gaussian RBF kernel is used and three-fold cross validation is adopted to tune both the regularization parameter and kernel width from (2 10, 2 9, . . . , 210) on the training set. For pa SB-MSVM, we use three-fold cross validation on the training set to select a kernel width from (2 10, 2 9, . . . , 210). We choose the default LIBSVM settings for all the other parameters. We consider adaptive multi-hyperplane machine (AMM) of Wang et al. (2011b), as implemented in the Budget SVM1 (Version 1.1) software package (Djuric et al., 2013). We use the batch version of the algorithm. Important parameters of the AMM include both the regularization parameter ν and training epochs E. As also mentioned by Kantchelian et al. (2014), we do not observe the testing errors of AMM to strictly decrease as E increased. Thus, in addition to cross validating the regularization parameter ν on the training set from {10 7, 10 6, . . . , 10 2}, as done in Wang et al. (2011b), for each ν, we try E {5, 10, 20, 50, 100} sequentially until the cross-validation error begins to decrease, i.e., under the same ν, we choose E = 20 if the cross-validation error of E = 50 is greater than that of E = 20. We use the default settings for all the other parameters, and calculate average classiﬁcation error rates. We add an outlier to the vehicle data in Section 6.6 as follows. There are 18 covariates, whose values range from 1 to 1, in this data set. To simulate an outlier, since the MLR regression coeﬃcients associated with the 4th, 5th, and 6th covariates all have large absolute values, we ﬁrst draw three uniform random numbers from ( 3, 2) (2, 3) and assign them to these three covariates, and then assign each of the 12 remaining covariates a uniform random number from ( 1, 1). Finally, we draw the category label y uniformly from {1, 2, 3, 4}.

J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc., 88(422):669 679, 1993.

D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121 143, 2006.

B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classiﬁers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144 152. ACM, 1992.

C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011.

Q. Chang, Q. Chen, and X Wang. Scaling Gaussian RBF kernel width to improve SVM classiﬁcation. In 2005 International Conference on Neural Networks and Brain, volume 1, pages 19 22. IEEE, 2005.

1. http://www.dabi.temple.edu/budgetedsvm/

Zhang and Zhou

V. Cherkassky and Y. Ma. Practical selection of SVM parameters and noise estimation for SVM regression. Neural networks, 17(1):113 126, 2004.

Y. Chung and D. B. Dunson. Nonparametric Bayes conditional distribution modeling with variable selection. J. Amer. Statist. Assoc., 104:1646 1660, 2009.

C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273 297, 1995.

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265 292, 2002.

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.

N. Djuric, L. Lan, S. Vucetic, and Z. Wang. Budgetedsvm: A toolbox for scalable SVM approximations. J. Mach. Learn. Res., 14:3813 3817, 2013.

D. B. Dunson and J.-H. Park. Kernel stick-breaking processes. Biometrika, 95(2):307 323, 2008.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classiﬁcation. J. Mach. Learn. Res., pages 1871 1874, 2008.

T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2): 209 230, 1973.

James M. Flegal, John Hughes, and Dootika Vats. mcmcse: Monte Carlo Standard Errors for MCMC. Riverside, CA and Minneapolis, MN, 2016. R package version 1.2-1.

X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In AISTATS, pages 315 323, 2011.

W. H. Greene. Econometric Analysis. Upper Saddle River, New Jersey: Prentice Hall, 2003.

J. Griﬁn and M. Steel. Order-based dependent Dirichlet processes. J. Amer. Statist. Assoc., 2006.

B. Gr unbaum. Convex Polytopes. Springer New York, 2013.

W. M. Hanemann. Discrete/continuous models of consumer demand. Econometrica: Journal of the Econometric Society, pages 541 561, 1984.

T. Helleputte. Liblinea R: Linear Predictive Models Based on the LIBLINEAR C/C++ Library, 2015. R package version 1.94-2.

R. Henao, X. Yuan, and L. Carin. Bayesian nonlinear support vector machines and discriminative factor modeling. In NIPS, pages 1754 1762, 2014.

N. L. Hjort, C. Holmes, P. M uller, and S. G. Walker. Bayesian Nonparametrics, volume 28. Cambridge University Press, 2010.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

C. C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145 168, 2006.

K. Imai and D. A. van Dyk. A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics, 124(2):311 334, 2005.

H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc., 96(453), 2001.

A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, pages 50 67, 2005.

A. Kantchelian, M. C. Tschantz, L. Huang, P. L. Bartlett, A. D. Joseph, and J. D. Tygar. Large-margin convex polytope machine. In NIPS, pages 3248 3256, 2014.

M. E. Khan, S. Mohamed, B. M. Marlin, and K. P. Murphy. A stick-breaking likelihood for categorical data analysis with latent Gaussian models. In AISTATS, pages 610 618, 2012.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, pages 1097 1105, 2012.

K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational Dirichlet process mixture models. In IJCAI, volume 7, pages 2796 2801, 2007.

Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436 444, 2015.

Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. J. Amer. Statist. Assoc., 99(465):67 81, 2004.

S. Linderman, M. Johnson, and R. P. Adams. Dependent multinomial models made easy: Stick-breaking with the P olya-Gamma augmentation. In NIPS, pages 3438 3446, 2015.

Chuanhai Liu. Robit regression: a simple robust alternative to logistic and probit regression. Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives, pages 227 238, 2004.

Y. Liu and M. Yuan. Reinforced multicategory support vector machines. Journal of Computational and Graphical Statistics, 20(4):901 919, 2011.

B. K. Mallick, D. Ghosh, and M. Ghosh. Bayesian classiﬁcation of tumours by using gene expression data. J. R. Stat. Soc: Series B, 67(2):219 234, 2005.

P. Mc Cullagh and J. A. Nelder. Generalized linear models, volume 37. CRC press, 1989.

R. Mc Culloch and P. E Rossi. An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64(1):207 240, 1994.

Zhang and Zhou

R. E. Mc Culloch, N. G. Polson, and P. E. Rossi. A Bayesian analysis of the multinomial probit model with fully identiﬁed parameters. Journal of Econometrics, 99(1):173 193, 2000.

D. Mc Fadden. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, pages 105 142, 1973.

D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2015. URL https://CRAN.R-project.org/package=e1071. R package version 1.6-7.

K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.

V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In ICML, pages 807 814, 2010.

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2014.

J. Pitman. Some developments of the Blackwell-Macqueen urn scheme. Statistics, Probability, and Game Theory: Papers in Honor of David Blackwell, 30:245 267, 1996.

N. G. Polson and S. L. Scott. Data augmentation for support vector machines. Bayesian Analysis, 6(1):1 23, 2011.

N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using P olya Gamma latent variables. J. Amer. Statist. Assoc., 108(504):1339 1349, 2013.

L. Ren, L. Du, L. Carin, and D. Dunson. Logistic stick-breaking process. J. Mach. Learn. Res., 12:203 239, 2011.

A. Rodriguez and D. B. Dunson. Nonparametric Bayesian models through probit stickbreaking processes. Bayesian Analysis, 6(1), 2011.

B. Sch olkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods: Support Vector Learning. MIT Press, 1999.

J. Sethuraman. A constructive deﬁnition of Dirichlet priors. Statistica Sinica, pages 639 650, 1994.

C. Soares, P. B. Brazdil, and P. Kuba. A meta-learning method to select the kernel width in support vector regression. Machine Learning, 54(3):195 209, 2004.

P. Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning, 46(1-3):21 52, 2002.

S. Srinivas. A generalization of the noisy-or model. In UAI, pages 208 215, 1993.

K. E. Train. Discrete Choice Methods with Simulation. Cambridge university press, 2009.

C. Wang, J. Paisley, and D. M. Blei. Online variational inference for the hierarchical Dirichlet process. In AISTATS, 2011a.

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Z. Wang, N. Djuric, K. Crammer, and S. Vucetic. Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classiﬁcation. In KDD, pages 24 32, 2011b.

Z. Zhang and M. I. Jordan. Bayesian multicategory support vector machines. In UAI, 2006.

M. Zhou. Inﬁnite edge partition models for overlapping community detection and link prediction. In AISTATS, pages 1135 1143, 2015.

M. Zhou. Softplus regressions and convex polytopes. ar Xiv:1608.06383, 2016.

M. Zhou, Y. Cong, and B. Chen. Augmentable gamma belief networks. J. Mach. Learn. Res., 17(163):1 44, 2016.