# baxter_permutation_process__3fca52ff.pdf Baxter Permutation Process Masahiro Nakano Akisato Kimura Takeshi Yamada Naonori Ueda NTT Communication Science Laboratories, NTT Corporation {masahiro.nakano.pr,akisato.kimura.xn,takeshi.yamada.bc,naonori.ueda.fr} @hco.ntt.co.jp In this paper, a Bayesian nonparametric (BNP) model for Baxter permutations (BPs), termed BP process (BPP) is proposed and applied to relational data analysis. The BPs are a well-studied class of permutations, and it has been demonstrated that there is one-to-one correspondence between BPs and several interesting objects including floorplan partitioning (FP), which constitutes a subset of rectangular partitioning (RP). Accordingly, the BPP can be used as an FP model. We combine the BPP with a multi-dimensional extension of the stick-breaking process called the block-breaking process to fill the gap between FP and RP, and obtain a stochastic process on arbitrary RPs. Compared with conventional BNP models for arbitrary RPs, the proposed model is simpler and has a high affinity with Bayesian inference. 1 Introduction Bayesian nonparametric (BNP) methods can overcome the model complexity problem of machine learning tasks, as they can be regarded as an analysis of finite subsets of potentially infinite data using infinite-dimensional probabilistic models, i.e., stochastic processes. Indeed, a variety of stochastic processes have been proposed and applied to various real-world tasks. However, in general, it is not easy to define and control new BNP models, because they should satisfy certain stringent conditions,1 such as projectivity [10, 43, 44, 45, 16], exchangeability [6, 7, 31, 32], and conditional projectivity [44, 45]. In this paper, we develop a BNP model of Baxter permutations (BPs). This model involves new stochastic processes and is applied to relational data analysis. Currently, there are a variety of BNP models for relational data analysis. Recent excellent surveys can be found in [20, 46]. Conventional models are broadly classified into three categories: (a) clustering through rectangular partitioning (RP), (b) factor analysis (extraction of multiple clusters) [14, 47, 52, 40, 30, 13], and (c) analysis using more flexible structures [5, 21, 24, 37, 38, 26, 19, 23, 22]. This paper focuses on the first category. Its advantage is that all clusters are disjoint rectangles characterized by products of subsets of each dimension of the relational data, which can be easily interpreted. For RP models, the infinite relational model (IRM) [33] and the Mondrian process (MP) [49, 48] have been widely studied and applied to real world applications. However, these models cannot represent arbitrary RPs. That is, their supports are limited to some subsets of all possible RPs (Figure 1, second and third). In contrast, the Gilbert tessellation (GT) [27, 39] and the rectangular tiling process (RTP) [42] have been proposed for arbitrary RPs with no restrictions (Figure 1, fourth). However, for the GT, it is known that the statistical behavior of it is notoriously difficult to analyze [12]. For the RTP, it constructs a probabilistic generative model that directly generates a RP of grids with infinite size. However, it has too complicated procedures for the model construction due to its projectivity property, and is not well-suited for Bayesian inference. Contributions - The aim of this paper is to construct a new BNP model for arbitrary RPs, so that it has a simple description and high affinity with Bayesian inference. We first discuss RPs and 1Plainly, these conditions are fundamental assumptions for dealing with infinity, that is, for BNP models to analyze finite subsets of potentially infinite data via infinite-dimensional probabilistic models. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Relational data and three classes of rectangular partitioning discussed in combinatorics [41]. (From left to right) First: Samples of (binary) relational data. Second: Regular grid - The rows and columns are partitioned into clusters. Each block is characterized by the product of the row and column clusters. Third: Hierarchical - Partitionings are expressed as binary trees where nodes represent a vertical or horizontal separation of a rectangle into two disjoint rectangles. Fourth: Arbitrary - No restrictions are required. This class is obtained by the proposed method. Fifth: Example not included in either hierarchical or regular grid. room1 room2 room3 room4 room1 room2 room3 room4 room1 room2 room3 room4 Figure 2: Left: Illustration of Aldous-Hoover-Kallenberg representation of exchangeable array. Right: Illustration of definition of FP. These different RP samples are equivalent in the sense of FP. floorplan partitioning (FPs) (plainly, FPs constitute a subset of RPs). Surprisingly, there is one-to-one correspondence between FPs and BPs [18], which are a class of permutations [9]. Based on this fact, the main contributions of this paper are to propose new stochastic processes shown as follows: The BP process (BPP): We construct a generative probabilistic BP model, the projectivity property of which ensures the existence of its limit, that is, an infinite BP model. By the one-to-one correspondence between BPs and FPs, the BPP can also be used as an FP model. The block-breaking process (BBP): We combine the BPP with block-breaking process, a multi-dimensional extension of the stick-breaking process [51], to fill the gap between FP and RP. We apply the BBP to the Aldous-Hoover-Kallenberg representation [6, 29, 32] to obtain a BNP model for arbitrary RPs of relational data. 2 Preliminaries 2.1 Relational models, Rectangular partitioning (RP), and Floorplan partitioning (FP) In this paper, RP can be regarded as partitions of [0, 1] [0, 1] such that all blocks form disjoint rectangle clusters of [0, 1] [0, 1]. By the Aldous-Hoover-Kallenberg (AHK) representation theorem [6, 29, 32] for exchangeable arrays, the RP has high affinity with the BNP model. Figure 2 (left) shows an illustration of the AHK representation. We assume that an observation of relational data consists of rows indexed by {1, . . . , N} and columns indexed by {1, . . . , M}. Given some BNP models for RP, a generative probabilistic model of the relational data can be easily constructed as follows. First, we draw an RP sample based on some BNP models. Then we draw independent and identically distributed (i.i.d.) uniform random variables: U row i Uniform([0, 1]) (i = 1, 2, . . . , N), U column j Uniform([0, 1]) (j = 1, 2, . . . , M). (1) Finally, the cluster assignment of each element, with row and column indexed by i and j, respectively, is specified by the block on [0, 1] [0, 1] to which the point (U column i , U column j ) belongs. According to the AHK representation, we can focus on constructing BNP models for RP. In addition, we introduce another important concept, namely FP. In an FP, the size of each rectangle block of the room partition is irrelevant. We follow the definition in [50] regarding the notion of equivalence for two FP samples. Figure 2 (right) shows an example. Given an FP sample f, a segment Figure 3: Illustration of Algorithm 1. Left: The top-left room is labeled as 1, and deleted by the topleft room deletion operator. Likewise, the top-left room is labeled as 2, . . . , and delete it hereinafter. As a result, all rooms are labeled by 1, 2, . . . . Right: The BP is obtained by repeatedly extracting the label of the bottom-left room and deleting it using the bottom-left room deletion. (cut) s supports a room (block) r in f if s contains one of the edges of r. We say that s and r have a top-, left-, right-, or bottom-seg-room relation if s supports r from the respective direction. Two FP samples are equivalent if there is a labeling of their rooms and segments such that they hold the same seg-room relations under the labeling. Thus, three FP samples in Figure 2 (right) are equivalent. 2.2 Baxter permutations In 1964, Glen Baxter introduced a class of permutations in the context of fixed points for the composition of commuting functions, which now bear his name [9]. A Baxter permutation (BP) on {1, 2, . . . , n} (n N) is a permutation π = (σ1σ2 . . . σn) for which there are no quadruples of indices i < j < j + 1 < k such that σj < σk < σi < σj+1 or σj+1 < σi < σk < σj. (2) For example, a permutation π = (σ1σ2 . . . σ8) = 61832547 is not Baxter, since it contains a quadruple 1 < 3 < 4 < 8 such that σ4 = 3 < σ1 = 6 < σ8 = 7 < σ3 = 8. For more intuitions, consider the case of n = 4. All permutations of {1, 2, 3, 4} are listed as follows: 1234, 1243, 1324, 1342, 1423, 1432, 2134, 2143, 2314, 2341, 2413, 2431, 3124, 3142, 3214, 3241, 3412, 3421, 4123, 4132, 4213, 4231, 4312, 4321. (3) A BP avoids the patterns, 3142 and 2413. Such patterns with prescribed adjacencies are often termed vincular patterns. The BPs are a well-studied class of permutations, which have a number of nice properties associated to them. We briefly review the most relevant two properties of the BPs in this paper. First, there is a one-to-one correspondence between BPs and several combinatorial objects, such as twin binary trees, plane bipolar orientations and some type of three non-intersecting paths on a grid [18, 25]. Especially, in this paper, we focus on its application to the FP. We show a direct bijection between FP and BP, introduced by [55, 50]. Second, we introduce some useful properties related to the enumeration of the BPs, and describe the enumeration algorithm proposed in [15]. 2.2.1 Mapping from floorplan partitioning to Baxter permutation We first define the following operator on FP. Given an FP sample with n rooms in [0, 1] [0, 1] as its bounding rectangle, we can obtain a FP sample with (n 1) rooms by using the following room deletion operator, introduced by [28]. The top-left room deletion is defined as follows: Definition (Top-left room deletion). Let f be an FP sample with n > 1 rooms and let r be the top-left room in f. (1) If the bottom-right corner of r has a junction, then we delete r from f by shifting the bottom edge upwards while keeping all junctions on the bottom edge attached, until the edge reaches the bounding rectangle. (2) If the bottom-right corner of r has a junction, then we delete r from f by shifting the right edge leftwards while keeping all junctions on the right edge attached, until the edge reaches the bounding rectangle. Similarly, we can define the bottom-left room deletion operator. Then, according to the top-left and bottom-left room deletion operators, we can obtain the mapping from the FP into the BP. Figure 3 shows an illustration of Algorithm 1. The output of Algorithm 1 is always a BP, as shown in [17] (Lemma 3.6). Moreover, the mapping corresponding to Algorithm 1 is injective [17] (Lemma 3.7). Next we move on to the mapping from the BP to the FP. Algorithm 1 MAPPING FLOORPLAN PARTITIONING TO BAXTER PERMUTATION Input: Floorplan partitioning f with n rooms. Assign labels 1, 2, . . . , n in ascending order into n rooms by repeatedly labeling the top-left room and applying top-left room deletion operator to it (Figure 3, left). Output: Return the permutation of labels obtained by repeatedly extracting the label of the bottom-left room and applying the bottom-left room deletion operator into it (Figure 3, right). Figure 4: Illustration of Algorithm 2. A BP sample π = (σ1σ2 . . . σn) = 25314 is transformed to a FP sample. First, we draw a block labeled as σ1 = 2, and construct a 5 5 grid. Second, since we have σ2 = 5 > σ1 = 2, we bisect the top-right block by a vertical segment at the second grid. Third, since we have σ3 = 3 < σ2 = 5, we bisect the top-right block by a horizontal segment at the third grid. Fourth, we bisect the top-right block by a horizontal segment, and then extend the block σ4 = 1 leftward at the expense of σ1 = 2, since the block σ1 = 2 to the left of σ4 = 1 has a label greater than σi. Finally, Algorithm 2 obtains the corresponding FP sample to 25314. 2.2.2 Mapping from Baxter permutation to floorplan partitioning Given a BP on {1, . . . , n}, Algorithm 2 constructs a FP sample with n rooms [17]. As is shown in Figure 4, the algorithm iteratively inserts rooms one by one into the top-right corner of the FP. The i-th room is generated by bisecting the previous room, and is labeled according to the i-th element in the BP. If the (i 1)-th element is smaller (resp., greater) than the current element, the room is bisected vertically (resp., horizontally). The resulting horizontal (resp., vertical) segment is extended leftward (resp., downward) if the room to the left (resp., below) has a greater (resp., smaller) label than that of the current room. Algorithm 2 MAPPING BAXTER PERMUTATION TO FLOORPLAN PARTITIONING Input: Baxter permutation π = (σ1σ2 . . . σn). Draw a block and label it as σ1. Construct an n n grid within the block. for i=2 to n do if σi < σi 1 then Bisect the top-right block by a horizontal segment at the i-th grid. Label the new top-right block as σi. while t do he block σ to the left of σi has a label greater than σi, Extend the block σi leftward at the expense of σ . end while else Bisect the top-right block by a vertical segment at the i-th grid. Label the new top-right block as σi. while t do he block σ below σi has a label smaller than σi, Extend the block σi downward at the expense of σ . end while end if end for Output: Floorplan partitioning with n blocks. 2.2.3 Enumeration of Baxter permutations In order to construct a generative BP model, the enumeration algorithm proposed in [15] is quite useful. Here, we briefly review the enumeration process for BPs. 3 1 2 5 6 4 7 3 1 2 8 5 6 4 7 3 1 2 8 9 5 6 4 7 10 3 1 2 8 9 5 6 4 7 3 1 2 5 6 4 3 1 2 7 5 6 4 3 1 2 8 7 5 6 4 3 1 2 8 7 9 5 6 4 3 1 2 8 7 9 5 6 10 4 3 1 2 5 6 4 Figure 5: Left: Illustration of BPP. Consider a BP 312564 Z6 and its latent parameters U1, . . . , U6. This BP has left-to-right maxima x1 = 3 < x2 = 5 < x3 = 6 and right-to-left maxima 6 = y2 > 4 = y1. If U7 is drawn from the interval [U3, U5], then 7 is inserted to the immediate left of 5 of 312564, and the resulting BP on {1, . . . , 7} is 3127564. We emphasize that the BP is not equivalent to the order of U1, . . . , U7. Right: Illustration of FP evolution according to underlying BPP. Two FP samples are growing according to the BPP. Instead of direct transformations from a FP with n blocks to a FP with n + 1 blocks, the evolution of a FP is obtained only through the underlying evolution of a BP by using Algorithm 2. For example, we consider an evolution of a BP from 312564 to 3127564. We apply Algorithm 2 to both 312564 and 3127564, and obtain the corresponding FPs to 312564 and 3127564, respectively. The first property is that BPs are closed under removing the largest label, leading to the projectivity property of the BPP for Kolmogorov s extenstion theorem (discussed later in Section 3, Proposition 3.2). We note that this is not immediately obvious, as BPs are given by a vincular pattern, that involves adjacency issues. However, the following was positively proved in [15] ([17], Lemma 3.1): Proposition 2.1 If π = (σ1σ2 . . . σn) is a BP on {1, . . . , n}, and we remove its largest label σi = n, then the result is also a BP. The second issue is a method for generating a BP on {1, . . . , n} from a BP on {1, . . . , n 1}. Proposition 2.1 means that every BP on {1, . . . , n} arises from a BP on {1, . . . , n 1} by inserting n into an admissible position. Fortunately such admissible positions were explicitly determined in [15]: Proposition 2.2 Given a BP on {1, . . . , n 1}, we consider the BP on {1, . . . , n} by inserting n. The admissible positions where n can be inserted are limited to each of the immediate left of the left-to-right2 maxima, and to each of the immediate right of the right-to-left maxima. The third property is whether we can enumerate all possible BPs by the procedure shown in Proposition 2.2, which specifies the support of the BPP (discussed later in Section 3, Proposition 3.1): Corollary 2.3 Consider the generating tree for BP that every node on the n-th level corresponds to a BP on {1, . . . , n}, and has the children nodes obtained by inserting (n + 1) into all admissible positions of the corresponding BP of the parent node, described in Proposition 2.2. For any n N, the set of the BPs corresponding to the nodes on the n-th level of this generating tree is equivalent to all BPs on {1, . . . , n}. 3 Baxter permutation process (BPP) The first contribution of this study is a BNP model for BPs. Let Zn be the set of all BPs on {1, . . . , n}. The BPP is a discrete-time Markov process on BPs and generates an object that, on the n-th time, corresponds to a BP sample on Zn. We present an illustrative example of the proposed model. Given the BP sample 312564 Z6, we consider the possible BPs obtained by inserting 7 into admissible positions. According to Proposition 2.2, these positions are immediately left of the left-to-right maxima 3, 5, 6 and immediately right of the right-to-left maxima 4, 6, that is, |{z} 3 1 2 |{z} 5 |{z} 6 |{z} 4 |{z} . (4) As shown in this example, the evolution of the BPP depends on the left-to-right and the right-to-left maxima, as well as the choice of the admissible positions. For notational convenience, we use 2Let σ1 . . . σn be a permutation on {1, . . . , n}. We call σi a left-to-right maximum if σi > σj for all j < i. Similarly, we call σi a right-to-left maximum if σi > σj for all j > i. Figure 6: Evolution of FP according to BPP. The left FP corresponds to 25314. The right four patterns are all possible FPs corresponding to the BPs in Z6 whose projection onto Z5 is 25314. We note again that we do not have direct transformations from the FP corresponding to 25314 to the FPs with 6 block. We apply Algorithm 2 to 625314, 265314, 256314 and 253146 independently to obtain the corresponding FPs. x1, x2, . . . , xi and y1, y2, . . . , yj to indicate the left-to-right maxima and the right-to-left maxima of a BP, respectively. In order to describe the evolution of the BPP, we introduce auxiliary variables, consisting of a sequence of independent and identically distributed (i.i.d.) uniform random variables U1, U2, . . . on [0, 1]. The resulting BPP sample on the n-th time is obtained from U1, . . . , Un. Figure 5 provides an illustration. In the following, we will provide a more precise description. Model description - The BPP is a discrete-time Markov process π := (π(tn), n N) over time t1, t2, . . . where each π(tn) is a BP sample on Zn. The BPP π(tn) on tn has a collection of latent parameters, consisting of i.i.d. uniform random variables U1, . . . , Un on [0, 1]. Given a sample π(tn) = (σ1σ2 . . . σn) generated from U1, . . . , Un, a sample π(tn+1) is drawn as follows. Without loss of generality, we can assume that π(tn) has left-to-right maxima x1 < < xi = n and right-to-left maxima n = yj > > y1. We additionally assume that U1, . . . , Un satisfies Ux1 < Ux2 < < Uxi = Un = Uyj < Uyj 1 < < Uy1. (5) We note that this assumption is not obvious, and therefore it will be proved by mathematical induction. For convenience, we let Ux0 = 0 and Uy0 = 1. The above inequality implies that the real line [0, 1] is divided into intervals [Ux0, Ux1], [Ux1, Ux2], . . . [Uxi 1, Uxi], [Uyj, Uyj 1],. . . , [Uy1, Uy0]. Then, the latent parameter Un+1 is independently drawn from the uniform distribution on [0, 1]. If Un+1 is located on the interval [Uxk 1, Uxk] (k = 1, . . . , i), then (n + 1) is inserted to the immediate left of xk. If Un+1 is located on the interval [Uyl, Uyl 1] (l = 1, . . . , j), then (n + 1) is inserted to the immediate right of yl. By construction, Equation (5) also holds for U1, . . . , Un+1. Therefore, by induction, Equation (5) holds for all n N. For example, we consider the BP π(t6) = 312564 Z6, as shown in Figure 5. We assume that U1 . . . , U6 is drawn as the top of Figure 5 (left). This BP has left-to-right maxima x1 = 3 < x2 = 5 < x3 = 6 and right-to-left maxima 6 = y2 > 4 = y1, as shown in the middle. If U7 is drawn on the interval [U3, U5], then 7 is inserted to the immediate left of 5 of 312564, and the resulting BP π(t7) Z7 corresponds to 3127564. We note that the BP is not equivalent to the order of U1, . . . , U7. Properties - The BPP π(tn) can define the probability measures µn on (Zn, 2Zn). In the following, we study some properties of µn. All proofs are provided in the supplementary material. First, we study the support of µn. It has positive probabilities for any possible BPs. Theorem 3.1 (Support). For any n N and zn Zn, we have µn(zn) > 0. Subsequently, we prove that by Kolmogorov s extension theorem, the projective limit µ of probability measures µn (n ) exists: Theorem 3.2 (Projectivity). Let µn n N be the family of probability measures, derived from the BPP. The projector Qm,n : Zm Zn (n < m N) is defined as follows: For a BP on {1, . . . , m}, the projector Qm,n removes the largest (m n) labels of the permutation and generates a new BP on {1, . . . , n}. Then, for any n < m N and An 2Zn, we have the projectivity3 property: µn(An) = µm(Q 1 m,n An). Accordingly, by Kolmogorov s extension theorem, the family of probability measures µn, Qm,n n m N is uniquely extended to the projective limit probability measure µ of the BP on {1, 2, . . . }. 3In this area of research, projective or self-similar RP is a very popular notion. Therefore, one might think that this projectivity property of the BPP is carried over into self-similarity of the corresponding FP (Fig.5, right). However, it is not true. For example, the FP corresponding to 3127564 is not self-similar to that of Q7,63127564 = 312564. The projectivity property of the BPP is entirely considered in the Baxter permutation domain, whose main purpose is the existence of a model of BPs on {1, . . . , }. Figure 7: Left: Illustration of BBP. The BBP sequentially adds a new bottom-right block into the current rectangular partitioning. For visibility, C2, Cmin 2 , C3, Cmin 3 are omitted. 4 Block-breaking process (BBP) The BPP can also be used for FP, according to Algorithm 2. However, we have to fill the gap between FP and RP to construct a BNP model based on the AHK theorem. Our strategy is to introduce size adjusting parameters to generate sized blocks of RP from corresponding size-less rooms of FP, which are generated by the BPP. As shown in Figure 6, the evolution of the BPP corresponds to adding a new bottom-right room to the FP. We additionally introduce a sequence of i.i.d beta random variables into the BPP to control the size of the rooms of the FP drawn from the BPP. As in the stick-breaking process (SBP) of [0, 1] [51], the new process is termed block-breaking process (BBP) of [0, 1] [0, 1]. High-level sketch - The BBP can be broadly interpreted as a multi-dimensional extension of the SBP. We recall that the SBP generates infinite number of sticks of a line [0, 1] by recursively drawing a beta random variable β and breaking the remaining stick at a ratio of β : (1 β). Plainly, the BBP replaces the line [0, 1] and the sticks of the SBP with the bounding rectangle [0, 1] [0, 1] and rectangle blocks, respectively. The central difficulty of the construction of the BBP unlike the SBP is to have to additionally care about to which directions a new partition should be added recursively. Therefore, we employ the BPP to navigate the evolution of the underlying FP. Following this intuition, we now provide a more precise description. Model description - The BBP is a discrete-time Markov process b := (b(tn), n N) over time t1, t2, . . . where each b(tn) is an RP sample with n blocks. The BBP b(tn) on tn has a collection of latent parameters, consisting of i.i.d. uniform random variables U1, . . . , Un on [0, 1], and i.i.d. beta random variables β1, . . . , βn 1. Figure 7 shows an illustration of the generative BBP model. We consider an RP sample r(tn 1) obtained from U1, . . . , Un 1 and β1, . . . , βn 2, and an FP sample f(tn 1) with (n 1) rooms, also obtained from U1, . . . , Un 1 according to the BPP. Given b(tn 1) and f(tn 1), a sample b(tn) at the next time tn is drawn as follows. We first draw Un and βn 1 from the uniform and the beta distributions, respectively. If the right-bottom corner of the (n 1)-th room of f(tn) is on the left (or top) side of the right-bottom corner of the n-th room of f(tn), then let Cn be the set of all blocks (light gray and dark gray in Figure 7) of b(tn) such that the corresponding rooms of f(tn) are adjacent to the left (or top) of the n-th room of f(tn) . Let Cmin n be a block (dark gray in Figure 7) in Cn with the minimum width (or height) ln. The n-th block of the RP is generated by cutting blocks in Cn so that the n-th block has a width (or height) (1 βn 1)ln. Properties - As is well known, the SBP-based mixture model (for sequence partitioning) has the following two useful properties. (a) It can express arbitrary partitions of any finite observations. (b) For sufficiently small ϵ > 0, the infinitely many sticks on [1 ϵ, 1] do not contribute the finite observation data, and the active partitions are concentrated on [0, 1 ϵ]. These properties are carried over into the BBP b = (b(t1), b(t2), . . . ). By construction, the top-left corner locations of all blocks of b(tn) are invariant on t tn. This leads to the two useful properties of the BPP-based relational model which is obtained by applying the limit b(t ) to the intermediate random function on [0, 1] [0, 1] of the AHK representation (described in Section 2.1). (a) One is the support of the BPP. The BBP covers arbitrary RPs: this can be easily deduced from the aforementioned property of the BBP constructively. (b) The other is concerning the number of active blocks of b(t ) for finite observations. We consider a finite observation matrix consisting of rows indexed by {1, . . . , N} and columns indexed by {1, . . . , M}. Let U row max and U column max be max{U row i | i = 1, . . . , N} and max{U column j | j = 1, . . . , M}, respectively. By construction of the BBP, there exists a natural number k < such that the top-left corner of the k-th block of b(t ) is located in the region [U row max, 1] [U column max , 1] with probability 1. As a result, all elements of the observation matrix must be assigned to the 1, . . . , (k 1)-th blocks. Therefore, typical Bayesian inference methods, such as Markov chain Monte Carlo (MCMC), can naturally avoid handling an infinite number of blocks. 5 Application to relational data analysis Relational model - The BBP-based relational model is applied to the input matrix X := (Xi,j)N M. We assume that X consists of categorical elements, that is, Xi,j {1, 2, . . . , H}, where H N. The generative model can be constructed as follows. The BBP consists of i.i.d. uniform random variables U := (U1, U2 . . . ) on [0, 1], and i.i.d. beta random variables β := (β1, β2, . . . ): Uk Uniform([0, 1]), βk Beta(1, α) (k = 1, 2, . . . ), (6) where α is a non-negative hyper-parameter. For notational convenience, we also use U k = (U1, U2 . . . , Uk) and βk = (β1, β2, . . . , βk). They correspond to a sample of rectangular partitioning on [0, 1] [0, 1]. The k-th block has a latent Dirichlet random variable φk: φk Dirichlet(α0) (k = 1, 2, . . . ), (7) where α0 = (α0, . . . , α0) is a H-dimensional non-negative hyper-parameter. According to the AHK representation [6, 29, 32], each row and column of the input matrix is mapped into [0, 1]: U row i Uniform([0, 1]) (i = 1, 2, . . . , N), U column j Uniform([0, 1]) (j = 1, 2, . . . , M). (8) Finally, given the row locations U row := (U row 1 , . . . , U row N ), the column locations U column := (U column 1 , . . . , U column M ), the BBP parameters consisting of U = (U1, U2 . . . ) and β = (β1, β2, . . . ), and (φ1, φ2, . . . ), each element Xi,j of the input matrix is drawn from the H-dimensional categorical distribution: Xi,j | U row, U column, U, β, φk(i,j) Categorical(φk(i,j)), (9) where k(i, j) indicates the block index to which the point (U column i , U column j ) belongs. We compare the BBP-based relational model with the BNP stochastic block models based on RP: (1) The IRM [33]: the intermediate random function of the AHK representation is drawn from the product of the SBPs, and the concentration parameter is drawn from the Gamma(1, 1) prior, as in [23]. (2) The MP [49]: the intermediate random function of the AHK representation is drawn from the MP, the budget parameter of which is set to 3, as in [23]. (3) The RTP [42]: we combine the product of the SBPs (also used in the aforementioned IRM) and the RTP is combined to construct the AHK representation. Bayesian inference - For all models, we used an MCMC method that iterates over (1) drawing U row and U column (i.e., the corresponding locations on [0, 1] of the rows and columns of the input matrix for the AHK representation), (2) updates of the current intermediate random function of the AHK representation (i.e., the current RP in the MCMC iterations), and (3) changing the complexity of the RP based on reversible jump schemes. To change the RP complexity of the MP and the RTP, we employ the methods in [53] and [42], respectively. For the reversible jump proposal of the BBP, a new block can be added, or the block with the largest label can be removed in the evolution of the BBP. The full description of our Bayesian inference method is provided in the supplementary material. The source code is available at https://github.com/nttcslab/baxter-permutation-process. Datasets - We synthetically generated three relational matrices, with ground-truth partitions corresponding to regular grid, hierarchical, and arbitrary RP samples, respectively. Each matrix consists of 300 300 binary elements drawn from the beta-Bernoulli likelihood model. We also used four social network datasets [54, 35] (corresponding to Figure 1): Wiki (top-left) [1], consisting of 7115 nodes and 103689 edges with diameter 7. Facebook (top-right) [2], consisting of 4039 nodes and 88234 edges with diameter 8. Twitter (bottom-left) [3], consisting of 81306 nodes and 1768149 edges with diameter 7. Epinion (bottom-right) [4], consisting of 75879 nodes and 508837 edges with diameter 14. For each data, we selected the top 1000 active nodes based on their interactions with others; subsequently we randomly sampled 500 500 matrix to construct the relational data, as in [23]. For model comparison, we held out 20% cells of the input data for testing, and each model was trained by the MCMC using the remaining 80% of the cells. We evaluated the models using perplexity as a criterion: perp( ˆX) = exp( (log p( ˆX))/N), where N is the number of non-missing cells in the partitioned matrix ˆX. Wall-clock time BBP RTP MP IRM Wall-clock time BBP RTP MP IRM Wall-clock time 1.2 BBP RTP MP IRM Wall-clock time BBP RTP MP IRM Facebook data Twitter data Epinion data Figure 8: Experimental results of perplexity comparison. Each column corresponds to each real world data (The results for synthetic data are reported in the supplementary material). Top: Relationship between test perplexity (mean std) evolution and wall-clock time. Bottom: Relationship between test perplexity evolution and 2000 MCMC iterations for 10 trials. Experimental results - Table 1 and Figure 8 summarize the test perplexity comparison results. We recall that the arbitrary RP (covered by the BBP and the RTP) includes the hierarchical RP (corresponding to the MP) and the regular grid RP (corresponding to the IRM). Therefore, we can expect that the BBP (and the RTP) essentially does not degrade the predictive performance for any ground-truth partitions. However, in practice, there may be certain issues related to Bayesian inference, such as local optima and slow mixing. Fortunately, as shown in Table 1, the BBP exhibits better (at least comparable) performance than the other three models. It can also be seen that the RTP achieves a predictive performance similar to that of the BBP (Figure 8, bottom). However, as shown in Figure 8 (top), the RTP has high computational cost. We also observe that the IRM performs faster mixing of the MCMC iterations than the BBP. This implies that the BBP may be improved by using more sophisticated inference methods, including sequential Monte Carlo methods [34, 26], particle Markov chain Monte Carlo samplers [23, 8], and Bayesian combinatorial optimization methods [36, 11]; this is a further research direction. Table 1: Perplexity comparison for real-world relational data analysis (mean std) IRM [33] MP [49] RTP [42] BBP (proposed) Synth (regular grid) 1.1791 0.0031 1.3690 0.0951 1.2709 0.0820 1.2136 0.0292 Synth (hierarchical) 1.2163 0.0145 1.2956 0.0913 1.2262 0.0314 1.2014 0.0105 Synth (arbitrary) 1.1299 0.0070 1.1983 0.0711 1.1406 0.0271 1.1161 0.0151 Wiki 1.2898 0.0045 1.2838 0.0094 1.2762 0.0085 1.2712 0.0056 Facebook 1.2012 0.0058 1.1944 0.0217 1.1895 0.0221 1.1818 0.0197 Twitter 1.2265 0.0038 1.2316 0.0209 1.2243 0.0067 1.2146 0.0058 Epinion 1.4088 0.0030 1.4098 0.0064 1.4078 0.0064 1.4006 0.0044 6 Conclusion This paper has proposed new stochastic processes. Our main contributions are as follows: (1) We have presented the BNP model of the BP as a Markov process consisting of a sequence of i.i.d. uniform random variables on [0, 1]. Owing to the one-to-one correspondence between BP and FP, the model can also be used as a probabilistic model on the set of all possible FPs. (2) We combined the BPP with the BBP to obtain a stochastic process for arbitrary RPs. As in conventional methods, we applied this process to the AHK representation to construct a BNP stochastic block model for relational data, and compared its predictive performance with that of the IRM, MP, and RTP. Broader Impact Clustering is one of the most fundamental machine learning tools for data analysis. The blockbreaking process (BBP) can be regarded as a multi-dimensional extension of clustering and it has a potential to give a new perspective to relational data analysis, for it would reveal latent structures in relational data (or network data) in much more flexible manner than other existing clustering methods, without tuning the model complexity. In fact, the BBP can extract latent clusters of relational data through rectangular partitioning (RP). While conventional models can express only limited classes of all possible RPs, the BBP can potentially capture arbitrary rectangular partitioning, keeping the central advantage of the Bayesian nonparametic (BNP) machine learning, and the BBP does not have to tune the model complexity regardless of the size of the input data. Therefore, the BBP will have a wide range of potential applications, including market research, pattern recognition, image processing, preand postprocessing of data, and structure learning of network models. For example, the BBP can be combined with deep neural network (DNN) models as a prior on the network, which simultaneously learns the DNN parameters and the network structure. It may also be used to expose and identify biases in data. The source code of the BBP-based relational model is available at https://github.com/nttcslab/baxter-permutation-process, with which you can try and examine the BBP-based relational data analysis by yourself. Our work is not facilitating any unethical aspects of machine learning technologies, by genuinely pursuing the development of Bayesian methods in many applications settings. However, as is often the case with any clustering methods (or more generally any predictive algorithms), our proposal can be misused in a variety of context. Since the BPP may reveal hidden clusters from any input relational matrices, unethical applications may lead to unexpected results due to unexpected cues. This problem is highly dependent on the choice of input data. Therefore, what is suitable as input data needs to be carefully considered from an ethical perspective. Funding disclosure Funding in direct support of this work is from NTT Corporation, without any third party funding. [1] http://snap.stanford.edu/data/wiki-Vote.html [2] http://snap.stanford.edu/data/ego-Facebook.html [3] http://snap.stanford.edu/data/ego-Twitter.html [4] http://snap.stanford.edu/data/soc-Epinions1.html [5] Airoldi, E.M., Costa, T.B., Chan, S.H.: Stochastic blockmodel approximation of a graphon: Theory andconsistent estimation. In: Advances in Neural Information Processing Systems (2013) [6] Aldous, D.J.: Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis 11, 581 598 (1981) [7] Aldous, D.J.: Exchangeability and related topics. École d Été St Flour, Lecture Notes in Mathematics 1117, 1 198 (1985) [8] Andrieu, C., Doucet, A., Holenstein, R.: Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(3), 269 342 (2010) [9] Baxter, G.: On fixed points of the composite of commuting functions. Proceedings of American Mathematical Society 15, 851 855 (1964) [10] Bochner, S.: Harmonic analysis and the theory of probability. University of California Press (1955) [11] Bouchard-Côté, A., Jordan, M.: Variational inference over combinatorial spaces. In: Advances in Nueral Information Processing Systems (2010) [12] Burridge, J., Cowan, R., Ma, I.: Full and half Gilbert tessellations with rectangular cells. Advances in Applied Probability 1, 1 19 (2013) [13] Caldas, J., Kaski, S.: Bayesian biclustering with the plaid model. In: 2008 IEEE Workshop on Machine Learning for Signal Processing. pp. 291 296 (2008) [14] Choi, D.S., Wolfe, P.J.: Co-clustering separately exchangeable network data. Annals of Statistics 42, 29 63 (2014) [15] Chung, F., Graham, R., Hoggatt, V., Kleiman, M.: The number of Baxter permutations. Journal of Combinatorics Theory, Series A 24, 382 394 (1978) [16] Crane, H.: Infinitely exchangeable partition, tree and graph-valued stochastic process. Ph.D. thesis, Department of Statistics, The University of Chicago (2012) [17] Dilks, K.: Quarter-turn Baxter permutations. ar Xiv:1710.07007 (2017) [18] Dulucq, S., Guibert, O.: Baxter permutations. Discrete Mathematics 180, 143 156 (1998) [19] Fan, X., Li, B., Sisson, S.A.: The binary space partitioning-tree process. In: International Conference on Artificial Intelligence and Statistics. pp. 1859 1867 (2018) [20] Fan, X., Li, B., Luo, L., Sisson, S.A.: Bayesian nonparametric space partitions: A survey. ar Xiv:2002.11394 (2020) [21] Fan, X., Li, B., Sisson, S.: Rectangular bounding process. In: Advances in Neural Information Processing Systems. pp. 7631 7641 (2018) [22] Fan, X., Li, B., Sisson, S.A.: Online binary space partitioning forests. ar Xiv:2003.00269 (2020) [23] Fan, X., Li, B., Sisson, S.A.: Binary space partitioning forests. ar Xiv:1903.09348 (2019) [24] Fan, X., Li, B., Wang, Y., Wang, Y., Chen, F.: The Ostomachion Process. In: AAAI Conference on Artificial Intelligence. pp. 1547 1553 (2016) [25] Felsner, S., Fusy, E., Noy, M., Orden, D.: Bijections for Baxter families and related objects. Journal of Combinatorial Theory, Series A 118(3), 993 1020 (2011) [26] Ge, S., Wang, S., Teh, Y.W., Wang, L., Elliott, L.: Random tessellation forests. In: Advances in Neural Information Processing Systems 32, pp. 9575 9585 (2019) [27] Gilbert, E.N.: Surface films of needle-shaped crystals. Applications of Undergraduate Mathematics in Engineering pp. 329 346 (1967) [28] Hong, X., Huang, G., Cai, Y., Gu, J., Dong, S., Cheng, C., Gu, J.: Corner block list: an effective and efficient topological representation of non-slicing floorplan. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (2000) [29] Hoover, D.N.: Relations on probability spaces and arrays of random variables. Tech. rep., Institute of Advanced Study, Princeton (1979) [30] Ishiguro, K., Sato, I., Nakano, M., Kimura, A., Ueda, N.: Infinite plaid models for infinite bi-clustering. In: AAAI Conference on Artificial Intelligence [31] Kallenberg, O.: On the representation theorem for exchangeable arrays. Journal of Multivariate Analysis 30(1), 137 154 (1989) [32] Kallenberg, O.: Symmetries on random arrays and set-indexed processes. Journal of Theoretical Probability 5(4), 727 765 (1992) [33] Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. In: AAAI Conference on Artificial Intelligence. pp. 381 388 (2006) [34] Lakshminarayanan, B., Roy, D., Teh, Y.W.: Mondrian forests: Efficient online random forests. In: Advances in Neural Information Processing Systems (2014) [35] Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web. pp. 641 650 (2010) [36] Lin, D., Fisher, J.: Efficient sampling from combinatorial space via bridging. In: International Conference on Artificial Intelligence and Statistics (2012) [37] Lloyd, J., Orbanz, P., Ghahramani, Z., Roy, D.M.: Random function priors for exchangeable arrays with applications to graphs and relational data. In: Advances in Neural Information Processing Systems (2012) [38] Lovász, L.: Very large graphs. Current Developments in Mathematics 11, 67 128 (2009) [39] Mackisack, M.S., Miles, R.E.: Homogeneous rectangular tessellation. Advances on Applied Probability 28, 993 (1996) [40] Miller, K., Jordan, M.I., Griffiths, T.L.: Nonparametric latent feature models for link prediction. In: Advances in Neural Information Processing Systems, pp. 1276 1284 (2009) [41] Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: The International Conference on Database Theory (1999) [42] Nakano, M., Ishiguro, K., Kimura, A., Yamada, T., Ueda, N.: Rectangular tiling process. In: Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 361 369 (2014) [43] Orbanz, P.: Infinite-dimensional exponential families in the cluster analysis of structured data. Ph.D. thesis, Eidgenössische Technische Hochschule Zürich (2008) [44] Orbanz, P.: Construction of nonparametric Bayesian models from parametric Bayes equations. In: Advances in Neural Information Processing Systems (2009) [45] Orbanz, P.: Conjugate projective limits. ar Xiv:1012.0363 (2011) [46] Orbanz, P., Roy, D.M.: Bayesian models of graphs, arrays and other exchangeable random structures. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 437 461 (2013) [47] Rodriguez, A., Ghosh, K.: Nested partition models. Tech. rep., Jack Baskin School of Engineering (2009) [48] Roy, D.M.: Computability, inference and modeling in probabilistic programming. Ph.D. thesis, Massachusetts Institute of Technology (2011) [49] Roy, D.M., Teh, Y.W.: The Mondrian process. In: Advances in Neural Information Processing Systems (2009) [50] Sakanushi, K., Kajitani, Y., Mehta, D.P.: The quarter-state-sequence floorplan representation. IEEE Trans. on Circuits and Systems I: Fundamental Theory and Applications 50, 376 386 (2003) [51] Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4, 639 650 (1994) [52] Shan, H., Banerjee, A.: Bayesian co-clustering. In: IEEE International Conference on Data Mining. pp. 530 539 (2008) [53] Wang, P., Laskey, K.B., Domeniconi, C., Jordan, M.I.: Nonparametric bayesian co-clustering ensembles. In: SIAM International conference on Data Mining. pp. 331 342 (2011) [54] Zafarani, R., Liu., H.: Social computing data repository at ASU (2009) [55] Zhang, X., Kajitani, Y.: Space-planning: placement of modules with controlled empty area by singlesequence. In: Proceedings of Asia and South Pacific Design Automation Conference (2004)