# permutoninduced_chinese_restaurant_process__d44eafe2.pdf

Permuton-induced Chinese Restaurant Process

Masahiro Nakano, Yasuhiro Fujiwara, Akisato Kimura, Takeshi Yamada, Naonori Ueda NTT Communication Science Laboratories, NTT Corporation {masahiro.nakano.pr, yasuhiro.fujiwara.kh, akisato.kimura.xn, takeshi.yamada.bc, naonori.ueda.fr}@hco.ntt.co.jp

This paper proposes the permuton-induced Chinese restaurant process (PCRP), a stochastic process on rectangular partitioning of a matrix. This distribution is suitable for use as a prior distribution in Bayesian nonparametric relational model to ﬁnd hidden clusters in matrices and network data. Our main contribution is to introduce the notion of permutons into the well-known Chinese restaurant process (CRP) for sequence partitioning: a permuton is a probability measure on [0, 1] [0, 1] and can be regarded as a geometric interpretation of the scaling limit of permutations. Speciﬁcally, we extend the model that the table order of CRPs has a random geometric arrangement on [0, 1] [0, 1] drawn from the permuton. By analogy with the relationship between the stick-breaking process (SBP) and CRP for the inﬁnite mixture model of a sequence, this model can be regarded as a multidimensional extension of CRP paired with the block-breaking process (BBP), which has been recently proposed as a multi-dimensional extension of SBP. While BBP always has an inﬁnite number of redundant intermediate variables, PCRP can be composed of varying size intermediate variables in a data-driven manner depending on the size and quality of the observation data. Experiments show that PCRP can improve the prediction performance in relational data analysis by reducing the local optima and slow mixing problems compared with the conventional BNP models because the local transitions of PCRP in Markov chain Monte Carlo inference are more ﬂexible than the previous models.

1 Introduction

The multi-dimensional extension of the Chinese restaurant process (CRP) [9, 50] for inﬁnitely exchangeable random arrays is one of the most signiﬁcant unsolved issues in the ﬁeld of Bayesian nonparametrics and its applications to machine learning. The standard CRP is the stochastic process representing random clustering of a sequence using the metaphor of customers sitting at tables. Typically, it is often used when we need to cluster a sequence where the number of clusters is unknown in advance. The magic of CRP is that it requires only a ﬁnite number of parameters for ﬁnite data; nevertheless, the model itself has inﬁnite representational power. Speciﬁcally, it is known that a mixture model using CRP is theoretically equivalent to the Dirichlet process mixture model (DPMM) [51, 27] represented by the stick-breaking process (SBP) [57] with an inﬁnite number of parameters. The relationship between the two DPMM representations, CRP and SBP, can be explained by whether the intermediate random variables in de Finetti s representation theorem [28, 37] are marginalized out or not (Figure 1, (a) and (b)). CRP has contributed greatly to the development of the BNP machine learning ﬁeld in the sense that it has made BNP not only theoretical but also practical. It enables models with inﬁnite representational power to be accurately simulated on a computer with a ﬁnite number of parameters without any approximation (e.g., ﬁnite truncation). In fact, a variety of extensions of CRP have been studied, including the nested CRP [16, 17, 60], the Chinese restaurant franchise [61], the tree-structured CRP [6], the distance-dependent CRP [14, 15, 31]. However, in the development of CRP, the extension to BNP relational models has remained a historical conundrum.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: (a) SBP representation for sequence partitioning based on de Finetti s representation. (b) CRP representation for sequence partitioning. Each customer (triangle) sequentially choose a table (dark circle). (c) BBP representation for rectangular partitioning based on AHK representation theorem. (d) Our proposed PCRP. We recall that a certain class of rectangular partitioning has a bijection to a certain class of permutation. Therefore, if we could add to the order based on permutations for CRP tables, we could convert the table assignments of data by CRP into rectangular partitioning. To achieve this, we introduce a random geometric location on [0, 1] [0, 1] into the CRP table drawn from permuton. (e) Illustration of permuton. Top shows an array representation of a permutation σ = 5724163. The horizontally i-th circle has the height σ(i). Bottom shows the corresponding permuton to σ = 5724163, i.e., the probability measure on [0, 1] [0, 1].

As a special case of the BNP relational models, the inﬁnite relational model (IRM) [38] consisting of the product of CRPs was proposed. The IRM can express random rectangular partitioning of a matrix; however, its representational capability is restricted to a minimal subset of rectangular partitioning called regular grid (Figure 2, left). Therefore, in order to obtain a model with higher representational capability, a variety of BNP models for relational data have been rapidly developed in recent years [38, 55, 54, 20, 53, 58, 46, 36, 19, 7, 23, 26, 43, 44, 30, 21, 25, 24, 56]. Recent excellent surveys can be found in [22, 49]. In particular, the Mondrian process (MP) [55, 54] and the rectangular tiling process (RTP) [48] have been proposed as new BNP models to expand the expressive classes of rectangular partitioning. However, both of them require an inﬁnite number of parameters as intermediate random variables in the Aldous-Hoover-Kallenberg (AHK) representation theorem [8, 34, 37] (an extension of de Finetti s theorem to multi-dimensional arrays), similar to the SBP representation for DPMM. As a result, Bayesian inference to those models always had to take care of an inﬁnite number of parameters, which compromise the ﬂexibility of local movements for the Markov chain Monte Carlo (MCMC) inference and make it highly affected by local optima and slow mixing problems. As a natural question, we wonder if it is possible to marginalize out the inﬁnite number of intermediate random variables of these models, as in the CRP representation of DPMM. Unfortunately, an approach to achieve this has remained unsolved for a long time.

This paper deals with the multi-dimensional extension of CRP to represent wider classes of rectangular partitioning than regular grid partitioning. As an important stepping stone for this, the block-breaking process (BBP) [47] has been proposed very recently as a multi-dimensional extension of SBP (Figure 1, (c)). The key insight for the BBP construction is that rectangular partitioning can be represented indirectly using permutations. Speciﬁcally, a special class of permutations called Baxter permutation [13], which has a one-to-one correspondence with a particular class of rectangular partitioning, is used to extend the stick-breaking procedures of SBP for a sequence to the block-breaking procedures for a matrix (Figure 3). Although BBP can indeed be used as an inﬁnitely exchangeable relational model, it requires an inﬁnite number of redundant parameters, similar to SBP, making Bayesian inference highly affected by local optima and slow mixing problems. Therefore, a CRP-like representation that avoids the inﬁnite number of redundant parameters of BBP, analogous to the relationship between SBP and CRP for DPMM, has been desired. Our contributions are summarized as follows:

Construction of new stochastic process - Section 3.1 proposes a stochastic process such that each table of the CRP has a random coordinates on [0, 1] [0, 1]. Speciﬁcally, we suppose that the coordinates of each table is independent and identically distributed (i.i.d.) random variable drawn from a permuton [35] on [0, 1] [0, 1]. Permuton is a probabilistic measure on [0, 1] [0, 1] and can also be regarded as a prior model for random permutations. By exploiting the correspondence between permutations and rectangular partitions [62, 52], PCRP can be used as a generative probabilistic

Figure 2: Several classes of rectangular partitions appear in this paper. The details of each class will also be described in the main body of Section 2.1. Left: Regular grid partitioning. Each block is characterized by the product of the row and column clusters. Second from left: Hierarchical partitioning. Third from left: Hierarchical partitioning with one place deeper in each layer. Fourth from left: Diagonal rectangulation. Right: Generic rectangulation.

Figure 3: Illustration of BBP [47], which can be regarded as a multi-dimensional extension of SBP. First BBP uses a Markov process on Baxter permutations, and transforms the evolution of Baxter permutations to the evolution of ﬂoorplan partitioning (i.e., rectangular partitioning ignoring the size of each rectangle block) by using the one-to-one correspondence between Baxter permutations and ﬂoorplan partitioning [33] (Left). Next BBP uses the block-breaking procedures to sequentially assign the random size to each rooms ﬂoorplan partitioning (Right).

model of rectangular partitioning of a matrix. While BBP [47] (i.e., a multi-dimensional extension of SBP) always has an inﬁnite number of redundant intermediate variables, our PCRP can be composed of varying size (ﬁnite) intermediate variables in a data-driven manner depending on the size and quality of the observation data. As a result, PCRP can lead to very ﬂexible local movements in Bayesian inference, reducing the problems of local optima and slow mixing.

Uniﬁed framework for various classes of rectangular partitioning - Conventionally, the stochastic processes for each class of rectangular partitioning have been constructed individually. However, according to the design of the permuton, PCRP provides a uniﬁed framework for various classes of rectangular partitioning. Section 3.2 shows an approach to tune the permuton of PCRP to adapt it to various classes of rectangular partitioning.

New representation of BNP model - As mentioned earlier, a multi-dimensional extension of CRP is a historical conundrum, and thus PCRP does not provide a perfect solution to this issue. In Section 4, the theoretical imperfection of PCRP is clariﬁed. Speciﬁcally, we intuitively explain that projectivity and exchangeability of PCRP change when viewed from two different perspectives: rectangular partitioning of a matrix and table assignment of data. Thus, we propose a new representation of the BNP model as a way to overcome this imperfection, that applies PCRP as a bridging state to existing inﬁnitely exchangeable relational models, inspired by the MCMC inference via bridging [42].

2 Preliminaries

2.1 Permutation, permutation class and its relation to rectangular partitioning

Permutation and rectangulation - A permutation of a set S is deﬁned as a bijection from S to itself. Speciﬁcally, this paper always focus on permutations of natural numbers S = [n] := {1, 2, . . . , n} (n N). For convenience, we use one-line notation for a permutation σ = σ1σ2 . . . σn. For example, the permutation σ = 312 means σ1 = 3, σ2 = 1, σ3 = 2. A rectangulation (also called rectangular partitioning) is a tiling of a rectangle by a ﬁnite number of disjoint rectangles such that no four of its rectangles share a single corner. In recent years, it has become clear that there is a close relationship between permutations and rectangulations. Here we focus in particular on the fact that every permutation can be uniquely converted into a generic rectangulation (Figure 2, right), that is, the class of rectangulations without any constraints:

Proposition 2.1 (See Proposition 6.2 in [39] and Proposition 4.2 in [52]) There is a surjective map from permutations to generic rectangulations.

Figure 4: Map from permutations (e.g., σ = 34861725) to diagonal rectangulations [39]. We ﬁrst draw distinct diagonal points on the diagonal, with one of the points being the top-left corner and another being the bottom-right corner. We perform the following steps sequentially. Let R be the union of the rectangles drawn in the ﬁrst i 1 steps. To draw the ith rectangle, we consider the label i on the diagonal. If the diagonal point p on the diagonal immediately above or left of the label i is not in R, then the upper left corner of the new rectangle is the rightmost point of R immediately to the left of p. If the diagonal point p immediately below or right of the label i is not in R, then the lower right corner of the new rectangle is the highest point of R immediately below p. If p is in R, then the lower-right corner of the new rectangle is the rightmost point of R immediately to the right of p.

Figure 5: Transformation from diagonal rectangulations to generic rectangulations [52]. Left: Two notions, edge and wall. An edge of the rectangulation is a line segment contained in the side of some rectangles such that the endpoints of the line segment are vertices and the segment has no vertices in its interior. A wall of the rectangulation is a maximal union of edges forming a line segment. Middle and right: Two examples of mapping from permutations (e.g., σ = 34861725 and σ = 34128675) to generic rectangulations. First, we convert the permutation to a diagonal rectangulation. Then we assign to the vertex on the walls the label of the block that contains that vertex as its own upper left or lower right corner. Finally, the order of the vertices on the wall will be rearranged according to the permutation. Speciﬁcally, vertices on the horizontal wall are aligned in permutation order from left to right, and vertices on the vertical wall are aligned in permutation order from bottom to top.

Fortunately, we can constructively obtain this map as the following two-stage transformation from a permutation to a general rectangulation by way of a diagonal rectangulation. A rectangulation is diagonal if the interior of each rectangle has an area that intersects the diagonal. Figure 4 and Figure 5 describe the details of the ﬁrst and second steps, respectively. We use this transformation as an important component in our generative model of PCRP (in Section 3.1).

Permutation classes and rectangulation classes - To investigate the properties of the permutation, we can focus on subsets of permutations as permutation classes. To identify a permutation class, we introduce the notion of occurrences. We suppose that τ and σ are permutations of size k and n, respectively. The occurrences of pattern τ in σ is a subsequence σi1, . . . , σik that is orderisomorphic to τ, that is, for all indices s, t [k], we have σis < σit τs < τt. One way to characterize a permutation class is often to focus on what patterns (typically very short permutations) are not contained as occurrences in the permutations belonging to that class. Conventionally, various permutation classes have been studied in depth, however, here we would like to focus on several permutation classes that are particularly relevant to rectangular partitioning. The details of each class will be explained in the supplementary material. It should be emphasized here that, surprisingly, permutations of a particular class have a one-to-one correspondence with rectangulations of a particular class. A brief list is given below. We will use this fact in Section 3.2 to show that PCRP can potentially serve as a uniﬁed model for the various classes of rectangular partitioning.

2-clumped permutation: The 2-clumped permutations have a bijection to generic rectangulations (Figure 2, right) [52], that no restrictions are required.

Baxter permutation: and twisted Baxter permutation: The Baxter permutations have a bijection to diagonal rectangulations (Figure 2, fourth from left) [62, 5, 32], which has a representative in which every rectangle s interior intersects the diagonal.

Separable permutation: Separable permutations have a bijection to hierarchical and diagonal rectangulations [59, 5], a subset of hierarchical partitioning (Figure 2, second from left), which are

expressed as binary trees where nodes represent a vertical or horizontal separation of a rectangle into two disjoint rectangles.

Separable skew-merged permutation: Separable skew-merged permutations have a bijection to a special subset of hierarchical partitioning, in particular, those where only one rectangle is allowed to be further cut in each layer (Figure 2, third from left) [45].

2.2 Permuton

In this paper, we introduce permuton [35] as a very useful tool for handling random permutations. Before deﬁning permuton, we begin with a geometric interpretation of a permutation (Figure 1, (e) top). To a permutation σ, we can associate a probability measure γσ on [0, 1] [0, 1] as follows. We ﬁrst divide [0, 1] [0, 1] into an n n grid of squares of size 1/n 1/n. Then, we can deﬁne the density of γσ on the square in the i-th row and j-th column to be the constant n if σi = j and 0 otherwise. Therefore, γσ can be regarded as a geometric representation of the permutation matrix of σ. More generally, we deﬁne a permuton to be a probability measure γ on [0, 1] [0, 1] with uniform marginals (Figure 1, (e) bottom): γ([a, b] [0, 1]) = b a = γ([0, 1] [a, b]) for all 0 a b 1. We emphasize that, for any permutation σ, the corresponding probability measure γσ is a permuton. Speciﬁcally, the permutation σ can be interpreted as the permuton γσ given by the sum of Lebesgue area measures [35]: γσ(A) = n Pn i=1 Leb (i 1)/n, i/n (σ(i) 1)/n, σ(i)/n A , for all Borel measurable set A of [0, 1] [0, 1], where Leb( ) indicates the Lebesgue measure.

We will provide the intuition behind why permuton is useful in dealing with random permutations. We suppose that τ and σ are permutations of size k and n, respectively. We set f occ(τ, σ) := (n!/k!(n k)!) #{occurrences of τ in σ}. Intuitively, f occ(τ, σ) means the probability to ﬁnd a pattern τ in σ, when we take k elements uniformly at random in σ. Similarly, if we replace permutation σ with permuton γ, we can consider the probability of occurrence of a pattern τ: f occ(τ, γ) := P[u1, . . . , uk form a pattern τ], where u1, . . . , uk are i.i.d. points on [0, 1] [0, 1] drawn from the permuton γ. Now, a natural question is whether these two f occ(τ, σ) and f occ(τ, γ) are generally consistent. The following theorem gives us an answer to this question:

Theorem 2.2 (See, e.g., Theorem 2.5 [29] and Lemma 3.5 [35]) We consider a sequence of permutations σ1, σ2, . . . , σn, . . . of size tending to inﬁnity n . Then, we have γσn γ for every τ, f occ(τ, σn) f occ(τ, γ). Moreover, if τ and σ are permutations of size k and n, respectively, then | f occ(τ, σ) f occ(τ, γσ)| k(k 1)/2n.

As a result, random permutations belonging to some class can be controlled by the limit of sequences of permutons. We use the permuton as a prior model for the PCRP table coordinates (in Section 3.1).

3 Permuton-induced Chinese restaurant process (PCRP)

This section proposes a new stochastic process, which is our main contribution to this paper. First, we describe the stochastic process, PCRP, in terms of a generative model and show how we can introduce the notion of the permuton into CRP to obtain a model for rectangular partitioning. Next, we discuss how to design the permuton so that PCRP can represent some classes of rectangular partitioning.

3.1 Model description of PCRP from generative probabilistic point of view

PCRP is composed of an input matrix, the concentration parameter η > 0, and the permuton γ, i.e., a probability measure on [0, 1] [0, 1]. The speciﬁc design of the permuton will be discussed in the next subsection, thus the reader may continue to read assuming, for example, γ( ) = Leb { } , that is, the uniform distribution on [0, 1] [0, 1]. The following four steps represent the overall picture of PCRP from the viewpoint of a probabilistic generative model.

Step 1: Table assignment process of all row and column elements of the input matrix by a single CRP (Figure 6, ﬁrst and second from left) - Each row (white triangle) and each column (black triangle) of the input matrix chooses either an existing table to sit at with probability St/(S +η) (t = 1, . . . , K) or a new table with probability η/(S + η), as in a standard CRP [9], where K is the number of tables, S is the total number of customers seated at the tables, St is the number of customers seated at the t-th table. If a new table is chosen, we draw the coordinates of the table

Figure 6: Overview of permuton-induced CRP, described in Section 3.1.

on [0, 1] [0, 1] from the permuton γ. Here, for simplicity, we show a model that delivers row and column elements to a single CRP. However, the model can be easily extended to use different CRPs for rows and columns as in the conventional IRM [38] (e.g., using the Chinese restaurant franchise [61], where CRPs for rows and columns share common dishes).

Step 2: Numbering the CRP tables and placement of customers on the sides of a rectangle (Figure 6, second and third from left) - The CRP tables are numbered according to the vertical order of the coordinates, which leads to a permutation σ = σ1 . . . σK. Note that we recursively perform this numbering each time we add a new table. Next, we consider a rectangle (referred to as the outer rectangle) and its main diagonal and divide it into K equal-sized line segments numbered 1, . . . K, from top left to bottom right. The customer corresponding to each row of the input matrix is moved to the left side position of the outer rectangle corresponding to the table number it belongs to. Similarly, we move the customer corresponding to each column of the input matrix to the top side position of the outer rectangle corresponding to the table number it belongs to.

Step 3: Transforming permutation to generic rectangulation (Figure 6, fourth and ﬁfth from left) - We apply the map from permutations to rectangulations described in Proposition 2.1 (Figure 4 and 5) to the permutation σ derived from the PCRP table coordinates in Step 2. This transformation is performed in two steps, ﬁrst converting the PCRP table assignment (Step 1) to a diagonal rectangulation (Figure 6, fourth), and then converting it further to a generic rectangulation (Figure 6, right). As a result, PCRP yields a random rectangular partition of the input matrix.

Intuitive interpretation of PCRP: We recall that PCRP consists of two components: (1) random table assignments to all the row and column elements of the input matrix, and (2) random permutations represented by the table coordinates. The former table assignments serve to roughly group the rows and columns of the input data, and the latter random table coordinates serve to transform those rough groupings into detailed rectangular partitioning. The practical advantage of PCRP is that when a new table of CRPs is added, it can be inserted into every possible place of the rectangular partition (with the consistency that the whole is a rectangular partition). This allows us to eliminate the problem of the disadvantage of the conventional BBP, where the insertion of a new rectangle is restricted to a speciﬁc place in the lower right corner (Figure 3).

3.2 Design of permuton to provide various representational capabilities to PCRP

In this subsection, we discuss how to speciﬁcally choose a permuton γ to restrict the rectangular partitions generated by PCRP to a speciﬁc class. We start with the simplest case of γ( ) = Leb { } , called the uniform permuton [35]. As is well known, if σn is a uniform random permutation of size n, then γσn converges in distribution to this uniform permuton. We recall here that every permutation can be transformed into some rectangular partition. Thus it seems that PCRP with the uniform permuton can be used as a probabilistic model for arbitrary rectangular partitioning. However, there is one crucial point to note: The mapping between permutations (a set of all permutations) and generic rectangulations is not a one-to-one correspondence. This means that even if the random permutations are uniform, the corresponding random rectangular partitions they are transformed into are not uniform and have some bias. Namely, there are some biases in PCRP with the uniform permuton, such that some rectangular partitions are more likely to appear and some are less likely to appear. We could ignore this bias for practical use, but theoretically, it is essential to know how to eliminate it. We brieﬂy introduce some permuton designs that are uniform for some class of rectangular partitioning. Here is a brief list, and details are given in the supplementary material.

Hierarchical partitioning with one place deeper in each layer (Figure 2, third from left) - Uniform partitioning of this class can be obtained by restricting the random permutation of coordinates in

the CRP tables to uniform separable skew-merged. If σn is a uniform separable skew-merged permutation of size n, then γσn converges in distribution to a deterministic permuton γ (See, e.g., Theorem 3.3 and Section 3.2.1 [12]). Intuitively, the permuton γ has an X-shaped density on [0, 1] [0, 1].

Hierarchical and diagonal rectangulation (Figure 2, second and fourth from left) - Uniform hierarchical and diagonal rectangulations can be expressed by uniform separable permutations. If σn is a uniform separable permutation of size n, then γσn converges in distribution to the Brownian separable permuton [11, 10, 45].

Diagonal rectangulation (Figure 2, fourth from left) - Uniform diagonal rectangulations can be expressed by uniform Baxter permutations. If σn is a uniform Baxter permutation of size n, then γσn converges in distribution to the Baxter permuton [18].

Generic rectangulation (Figure 2, right) - To the best of our knowledge, there is still no way to construct a permuton that corresponds to 2-clumped permutations, which have a bijection to generic rectangulations. This means that it is currently tricky to explicitly design a clever permuton that will always restrict its samples to 2-clumped permutations. One possible strategy is to relax the permuton itself to one that can generate a broad class of permutations (e.g., γ( ) = Leb { } ), and instead use rejection sampling to restrict the generated permutations so that they are actually 2-clumped. Details are given in the supplementary material.

4 Intermediate level representation between SBP and CRP

PCRP would be very useful in the practical use of relational data analysis due to the following two advantages. (1) The model itself adjusts the model complexity in a data-driven manner according to the quality and size of the input data. (2) The design of permuton allows us to handle various classes of rectangular partitioning in a uniﬁed manner. In this section, we explore the former point more deeply from the viewpoint of the theory of Bayesian nonparametrics. A precise explanation would require the notions of projective system, projectivity, and exchangeability, which are far from the scope of this paper. Therefore, we would like to focus on an intuitive explanation while providing a rigorous discussion in the supplementary material.

Motivation - As well as the standard CRP for sequence partitioning being an inﬁnitely exchangeable model, PCRP also has projectivity and exchangeability in terms of the CRP table assignments of data. However, from the perspective of rectangular partitioning model in relational data analysis, the projectivity and exchangeability of PCRP, unfortunately, do not hold. The implications of this fact will be explained in terms of the practical intuition for relational data analysis. We consider a pair of matrices I J {1, 2, . . . , } {1, 2, . . . , }. Namely, a smaller matrix I is an extraction of some rows and columns of a larger matrix J. As it turns out, the only theoretical imperfection of PCRP can be explained as follows: the random rectangular partitioning of the smaller matrix I that follows PCRP cannot be equivalent to the random rectangular partitioning of I when the random rectangular partitioning of the larger matrix J follows PCRP. In other words, PCRP does not have perfect self-similarity according to the size of the data in terms of the rectangular partitioning model, but only in the imaginary object of the CRP table assignments. This implies that PCRP may have to tune its model complexity (e.g., the concentration parameter of CRP) according to the size of the data. Nevertheless, in a practical sense, we believe that this is not such a signiﬁcant issue. Rather, local optima and slow mixing problems in Bayesian inference have a more signiﬁcant impact on the performance of the model. However, readers familiar with the theory of Bayesian nonparametrics may wish to have a BNP model that is strictly projective and exchangeable in terms of rectangular partitioning. For such readers, we also propose a new way to create an exact inﬁnite exchangeable model while using PCRP. Speciﬁcally, we apply PCRP as a bridging state to existing inﬁnitely exchangeable relational models, inspired by the MCMC inference via bridging [42].

We are interested in a set of rectangular partitioning of a matrix I {1, 2, . . . , } {1, 2, . . . , } whose rows and columns are indexed by natural numbers. We consider probability measures µI X on (XI, 2XI) where XI is a set of rectangular partitioning of a matrix I (i.e., a constrained combinatorial space), and hope to ﬁnd an exact BNP model (Figure 7, left). Needless to say, a number of the various BNP models have actually been proposed as the exact BNP model µI X, including the IRM [38], the MP-based relational model [55], and the BBP-based relational model [47]. However, conventionally, these exact BNP models have signiﬁcantly struggled with local optima and slow mixing problems in Bayesian inference. For example, in the standard MCMC inference phase, we wish to sample

Figure 7: Illustration of exact BNP model with bridging states. Left: Existing exact BNP model on rectangular partitioning XI. Middle: PCRP on imaginary table assignments of data YI. Right: BNP model on extended space XI YI. Sampling from µI X is equivalent to drawing samples (gray circles) from XI YI via the joint chain and discarding those (white circles) from YI.

from the distribution µI X over a constrained combinatorial space XI. Typically, by using some local moves, we can derive a Markov chain with a transition matrix, which may have slow mixing or even be non-ergodic. To mitigate these issues, we propose to introduce auxiliary spaces to relax the local movements in combinatorial spaces and to connect different regions of the sample space where communication is difﬁcult or impossible.

Exact BNP model with bridging states - First, as the auxiliary space YI, we introduce the table assignments of I by CRP and a set of coordinates on [0, 1] [0, 1] of the tables. The corresponding probability model µI Y on (YI, 2YI) should be PCRP described in the previous section (Figure 7, middle). Then, connecting the bridging state in the auxiliary space YI with samples in XI, we consider the combined probability model over the union space XI YI. To address probabilistic models on the union space XI YI, we modify the previous probabilistic models µI X and µI Y as follows: ˆµI X( 2XI YI) := µI X( XI) and ˆµI Y ( 2XI YI) := µI Y ( YI). For simplicity of notation, the modiﬁed probabilistic models ˆµI X and ˆµI Y will be denoted as original µI X and µI Y in the following. Finally, we consider a probabilistic model µI + (Figure 7, right) in form of µI + = αµI X + (1 α)µI Y , where 0 α 1 is a tunable real variable. Fortunately, we can treat this new model µI + as a BNP model, that is, there uniquely exists a projective limit of µI + (I N N) (See the supplementary materials for details). In summary, the BNP model with bridging states has the following advantages: (1) The family µI + I E can be treated in the same way as the usual BNP model, i.e., the model itself has self-similarity (self-consistency) regardless of the size of the observed inputs. (2) If we restrict the space of interest to XI (i.e. forget about the auxiliary space YI), it becomes equivalent to the exact BNP model on rectangular partitions. More speciﬁcally, we have µI X( ) = (1/α)µI +( XI).

5 Application to relational data analysis

5.1 PCRP-based relational model and Bayesian inference

Relational model - The PCRP-based relational model is applied to the input observation matrix Z = (Zi,j)N M. We suppose that the input matrix Z consists of categorical elements, i.e., Zi,j {1, 2, . . . , D} (D N). The generative model can be constructed as follows. First, a random table assignment is performed by a single CRP for all rows and columns of the input matrix (Step 1 described in Section 3.1). We will denote by ri ( N) the index of the table to which the i-th row is assigned and by cj ( N) the index of the table to which the j-th column is assigned. In the process of the table assignment by CRP, whenever a new table is generated, random coordinates for that table on [0, 1] [0, 1] will be drawn from permuton γ (Step 1). We will denote by Lt ( R2) the coordinates assigned to the t-th table (t = 1, 2, . . . , K), where K is the number of tables generated by CRP. With the PCRP table assignments T := (r1, . . . , r N, c1, . . . , c M) and the PCRP table coordinates L := (L1, . . . , LK), they can be uniquely transformed into a rectangular partition of the input matrix Z (Step 2-3). For simplicity, we denote by R(T , L) the rectangular partition derived from T and L. Each block indexed by k ( N) in the rectangular partition R(T , L) has a latent Dirichlet random variable ϑk Dirichlet(α0) (k = 1, 2, . . . , K), where α0 = (α0, . . . , α0) is a D-dimensional non-negative hyper parameter. Each element Zi,j is drawn from the categorical distribution with the parameter ϑk(i,j), where k(i, j) indicates the block index to which the entry with the i-th row and the j-th column belongs. In summary, it can be viewed as a problem of estimating PCRP table assignments T and PCRP table coordinates L, given the input matrix Z and the

Figure 8: Illustration of generating a new table in Gibbs sampling. Left: Coordinates of existing K tables with customers already seated. Second from left: (K + 1)2 subsections that are candidates for the coordinates from which the new table will be generated. Third from left: Subsection (colored in gray) [v1, v2] [h1, h2] chosen as the location where the new table will be generated. Right: Coordinates of the new table (colored in dark gray) drawn from permuton γ([v1, v2] [h1, h2]).

hyper-parameters: permuton γ, Dirichlet distribution parameters α0 > 0, and concentration parameter of PCRP η > 0.

Bayesian inference - The PCRP relational model can lead to very simple inference algorithms based on the MCMC methods. Following the generative model described above, the joint probability density is expressed as follows:

p (Z, T , L | γ, α0, η) = p CRP(T | η)

t=1 pperm.(Lt | γ)

pobs.(Z | R(T , L), α0), (1)

where the ﬁrst term p CRP(T | η) is the probability distribution for the standard CRP, the second term pperm.(Lt | γ) is the probability density that Lt is drawn from the permuton γ, and the third term is

pobs.(Z | R(T , L), α0)

Γ(Dα0 + PD d=1 Nk,d)

Γ(α0 + Nk,d)

where Nk,d denotes the number of elements in both the k-th block and the d-th category of the categorical distribution. The overview of the MCMC inference algorithm is to simulate the above joint probability by sequentially updating the table assignments T and the table coordinates L.

One simple question is whether PCRP can construct Gibbs sampling in updating table assignments like the usual CRP. Fortunately, PCRP can lead to Gibbs sampling. Updating table assignments for rows and columns involves assigning them to existing tables or generating a new table, similar to the standard CRP. Calculating the probability of the former is straightforward, whereas calculating the latter, i.e., the probability of generating a new table, requires a little more care. Figure 8 shows an illustration of the case where a new table is generated in Gibbs sampling. We suppose that there are currently K tables. The posterior probability of choosing a new table depends on the random coordinates of that table and the random rectangular partition derived from it, which has at most (K + 1)2 cases of new rectangular partitions. Speciﬁcally, for each of the K tables on [0, 1] [0, 1], we draw a horizontal and vertical crosshair line, respectively, to divide [0, 1] [0, 1] into (K + 1)2 subsections (Figure 8, second). The probability of the new table s coordinates falling into each of those (K + 1)2 subsections can be calculated using Equation (1). Therefore, the sum of those probabilities can be regarded as the probability that the new table will be generated. This can be calculated exactly without any approximation. Therefore, the conditional posterior distributions for ri and cj can be calculated explicitly, and Gibbs sampling can be easily performed. The whole MCMC algorithm will be discussed in the supplementary material. Our code will be available at https://github.com/nttcslab/permuton-induced-crp.

5.2 Experimental evaluation

We compare PCRP with the existing BNP models for rectangular partitioning, IRM [38], MP [55], RTP [48], and BBP [47]. The detailed experimental setup is described in the supplementary material.

Datasets - Four social network datasets [40]: Wiki [1], Facebook [2], Twitter [3], and Epinions [4]. All data is public and does not contain any personally identiﬁable information (See [41] for license). We selected the top 1000 active nodes based on their interactions with others; subsequently we randomly sampled 500 500 matrix to construct the relational data, as in [25, 47]. We held out 20% cells of the input data for testing, and each model was trained by the MCMC using the

101 102 103

Logarithm of wall-clock time

BBP RTP MP IRM PCRP

101 102 103

Logarithm of wall-clock time

BBP RTP MP IRM PCRP

Facebook data

101 102 103

Logarithm of wall-clock time

BBP RTP MP IRM PCRP

Twitter data

101 102 103

Logarithm of wall-clock time

BBP RTP MP IRM PCRP

Epinion data Figure 9: Experimental results of perplexity comparison for four real world data, relationship between test perplexity (mean std) evolution and logarithm of wall-clock time.

Facebook data

Twitter data

Epinion data Figure 10: Four examples of analysis results for each data by BBP (Top) and PCRP (Bottom).

remaining 80% of the cells. We evaluated the models using perplexity as a criterion: perp( ˆZ) = exp( (log p( ˆZ))/E), where E is the number of non-missing cells in the partitioned matrix ˆZ.

Experimental results - We ran 10 trials of analysis for each method on each data set. Figure 9 summarizes the test perplexity comparison results. We recall that IRM and MP are limited in the class of rectangular partitions they can represent, and only PCRP, BBP, and RTP can represent arbitrary rectangular partitioning. Indeed, it can be seen that these arbitrary rectangular partitioning models have good prediction accuracy when MCMC converges. Among them, PCRP in particular achieves the best prediction accuracy for all the data. As another perspective, BBP, RTP, and MP are SBP-type models, while PCRP and IRM are CRP-type models. It can be seen that PCRP improves the perplexity from an early stage compared to SBP-like methods. In light of the above, we can conﬁrm that PCRP can reduce the problems of local optima and slow mixing in Bayesian inference. As shown in Figure 10, BBP extracts a very biased rectangular partition due to the fact that the addition of new blocks tends to occur in the lower right corner. In other words, BBP only has a high probability of adding blocks to the lower right, so if a coarse cluster is created in the upper left, it will be difﬁcult to modify the coarse cluster into a ﬁne one. On the other hand, PCRP has the ﬂexibility to add new blocks to the entire area, so it can properly ﬁnd coarse and ﬁne clusters in the entire area.

6 Conclusion and Discussion

Summary - This paper has proposed a new stochastic process for relational data analysis. Our main contributions are as follows: (1) We introduce the notion of the permuton to the CRP and obtain a probabilistic model of rectangular partitioning of a matrix. (2) We show a uniﬁed framework for PCRP to various classes of rectangular partitioning based on the design of the permuton. (3) We discuss the theoretical imperfection of PCRP for projectivity and exchangeability in terms of rectangular partitioning and propose a new representation of the BNP model as a way to overcome this imperfection that applies PCRP as a bridging state to existing BNP relational models.

Funding disclosure

Funding in direct support of this work is NTT Corporation, without any third party funding.

[1] http://snap.stanford.edu/data/wiki-Vote.html

[2] http://snap.stanford.edu/data/ego-Facebook.html

[3] http://snap.stanford.edu/data/ego-Twitter.html

[4] http://snap.stanford.edu/data/soc-Epinions1.html

[5] Ackerman, E., Barequet, G., Pinter, R.Y.: A bijection between permutations and ﬂoorplans, and its applications. Discrete Applied Mathematics 154(12), 1674 1684 (2006)

[6] Adams, R.P., Jordan, M.I., Ghahramani, Z.: Tree-structured stick breaking for hierarchical data. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems. vol. 23. Curran Associates, Inc. (2010)

[7] Airoldi, E.M., Costa, T.B., Chan, S.H.: Stochastic blockmodel approximation of a graphon: Theory andconsistent estimation. In: Advances in Neural Information Processing Systems (2013)

[8] Aldous, D.J.: Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis 11, 581 598 (1981)

[9] Aldous, D.J.: Exchangeability and related topics. École d Été St Flour, Lecture Notes in Mathematics 1117, 1 198 (1985)

[10] Bassino, F., Bouvel, M., Feray, V., Gerin, L., Maazoun, M., Pierrot, A.: Universal limits of substitutionclosed permutation classes. Journal of the European Mathematical Society 22(11), 3565 3639 (2019)

[11] Bassino, F., Bouvel, M., Féray, V., Gerin, L., Pierrot, A.: The Brownian limit of separable permutations. The Annals of Probability 46(4), 2134 -2189 (2018)

[12] Bassino, F., Bouvel, M., Féray, V., Gerin, L., Maazoun, M., Pierrot, A.: Scaling limits of permutation classes with a ﬁnite speciﬁcation: a dichotomy. ar Xiv:1903.07522 (2019)

[13] Baxter, G.: On ﬁxed points of the composite of commuting functions. Proceedings of American Mathematical Society 15, 851 855 (1964)

[14] Blei, D., Frazier, P.: Distance dependent Chinese restaurant processes. In: International Conference on Machine Learning (2010)

[15] Blei, D., Frazier, P.: Distance dependent Chinese restaurant processes. Journal of Machine Learning Reseach 12, 2461 2488 (2011)

[16] Blei, D.M., Jordan, M.I., Grifﬁths, T.L., Tenenbaum, J.B.: Hierarchical topic models and the nested Chinese restaurant process. pp. 17 24 (2003)

[17] Blei, D.M., Grifﬁths, T.L., Jordan, M.I.: The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM 57(2) (2010)

[18] Borga, J., Maazoun, M.: Scaling and local limits of Baxter permutations and bipolar orientations through coalescent-walk processes. ar Xiv:2008.09086 (2020)

[19] Caldas, J., Kaski, S.: Bayesian biclustering with the plaid model. In: 2008 IEEE Workshop on Machine Learning for Signal Processing. pp. 291 296 (2008)

[20] Choi, D.S., Wolfe, P.J.: Co-clustering separately exchangeable network data. Annals of Statistics 42, 29 63 (2014)

[21] Fan, X., Li, B., Sisson, S.A.: The binary space partitioning-tree process. In: International Conference on Artiﬁcial Intelligence and Statistics. pp. 1859 1867 (2018)

[22] Fan, X., Li, B., Luo, L., Sisson, S.A.: Bayesian nonparametric space partitions: A survey. ar Xiv:2002.11394 (2020)

[23] Fan, X., Li, B., Sisson, S.: Rectangular bounding process. In: Advances in Neural Information Processing Systems. pp. 7631 7641 (2018)

[24] Fan, X., Li, B., Sisson, S.A.: Online binary space partitioning forests. ar Xiv:2003.00269 (2020)

[25] Fan, X., Li, B., Sisson, S.A.: Binary space partitioning forests. ar Xiv:1903.09348 (2019)

[26] Fan, X., Li, B., Wang, Y., Wang, Y., Chen, F.: The Ostomachion Process. In: AAAI Conference on Artiﬁcial Intelligence. pp. 1547 1553 (2016)

[27] Ferguson, T.: Bayesian analysis of some nonparametric problems. Annals of Statistics 2(1), 209 230 (1973)

[28] de Finetti, B.: Funzione caratteristica di un fenomeno aleatorio. Memorie della R. Academia Nazionale dei Lincei 4(5), 251 299 (1930)

[29] Frédérique, B., B. Mathilde, F.V., Lucas, G., M. Mickaël, P.A.: Universal limits of substitution-closed permutation classes. Journal of European Mathematical Society 22, 3565 3639 (2020)

[30] Ge, S., Wang, S., Teh, Y.W., Wang, L., Elliott, L.: Random tessellation forests. In: Advances in Neural Information Processing Systems 32, pp. 9575 9585 (2019)

[31] Ghosh, S., Ungureunu, A., Sudderth, E., Blei, D.: Spatial distance dependent Chinese restaurant processes for image segmentation. In: Advances in Neural Information Processing Systems (2011)

[32] He, B.D.: A simple optimal binary representation of mosaic ﬂoorplans and Baxter permutations. Theoretical Computer Science 532(1), 40 50 (2014)

[33] Hong, X., Huang, G., Cai, Y., Gu, J., Dong, S., Cheng, C., Gu, J.: Corner block list: an effective and efﬁcient topological representation of non-slicing ﬂoorplan. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (2000)

[34] Hoover, D.N.: Relations on probability spaces and arrays of random variables. Tech. rep., Institute of Advanced Study, Princeton (1979)

[35] Hoppen, C., Kohayakawa, Y., Moreira, C.G., Ráth, B., Sampaio, R.M.: Limits of permutation sequences. Journal of Combinatorial Theory, Series B 103(1), 93 -113 (2013)

[36] Ishiguro, K., Sato, I., Nakano, M., Kimura, A., Ueda, N.: Inﬁnite plaid models for inﬁnite bi-clustering. pp. 1701 1708 (2016)

[37] Kallenberg, O.: Symmetries on random arrays and set-indexed processes. Journal of Theoretical Probability 5(4), 727 765 (1992)

[38] Kemp, C., Tenenbaum, J.B., Grifﬁths, T.L., Yamada, T., Ueda, N.: Learning systems of concepts with an inﬁnite relational model. In: AAAI Conference on Artiﬁcial Intelligence. pp. 381 388 (2006)

[39] Law, S., Reading, N.: The hopf algebra of diagonal rectangulations. Journal of Combinatorial Theory, Series A 119(3), 788 824 (2012)

[40] Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web. pp. 641 650 (2010)

[41] Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection. http://snap. stanford.edu/data (Jun 2014)

[42] Lin, D., Fisher, J.: Efﬁcient sampling from combinatorial space via bridging. In: International Conference on Artiﬁcial Intelligence and Statistics (2012)

[43] Lloyd, J., Orbanz, P., Ghahramani, Z., Roy, D.M.: Random function priors for exchangeable arrays with applications to graphs and relational data. In: Advances in Neural Information Processing Systems (2012)

[44] Lovász, L.: Very large graphs. Current Developments in Mathematics 11, 67 128 (2009)

[45] Maazoun, M.: On the brownian separable permuton. Combinatorics, Probability and Computing 29(2), 241 266 (2019)

[46] Miller, K., Jordan, M.I., Grifﬁths, T.L.: Nonparametric latent feature models for link prediction. pp. 1276 1284 (2009)

[47] Nakano, M., Kimura, A., Yamada, T., Ueda, N.: Baxter permutation process. In: Advances in Neural Information Processing Systems (2020)

[48] Nakano, M., Ishiguro, K., Kimura, A., Yamada, T., Ueda, N.: Rectangular tiling process. In: Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 361 369 (2014)

[49] Orbanz, P., Roy, D.M.: Bayesian models of graphs, arrays and other exchangeable random structures. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 437 461 (2013)

[50] Pitman, J., Yor, M.: The two-parameter poisson-dirichlet distribution derived from a stable subordinator. IBM Journal of Research and Development 2(25), 855 900 (1997)

[51] Rasmussen, C.: The inﬁnite gaussian mixture model. In: Advances in Nueral Information Processing Systems (2000)

[52] Reading, N.: Generic rectangulations. European Journal of Combinatorics 33(4), 610 623 (2012)

[53] Rodriguez, A., Ghosh, K.: Nested partition models. Tech. rep., Jack Baskin School of Engineering (2009)

[54] Roy, D.M.: Computability, inference and modeling in probabilistic programming. Ph.D. thesis, Massachusetts Institute of Technology (2011)

[55] Roy, D.M., Teh, Y.W.: The Mondrian process. In: Advances in Neural Information Processing Systems (2009)

[56] Saad, F.A., Mansinghka, V.K.: Hierarchical inﬁnite relational model. In: Proceedings of the 37th Conference on Uncertainty in Artiﬁcial Intelligence. AUAI Press, Arlington, VA, USA (2021)

[57] Sethuraman, J.: A constructive deﬁnition of Dirichlet priors. Statistica Sinica 4, 639 650 (1994)

[58] Shan, H., Banerjee, A.: Bayesian co-clustering. In: IEEE International Conference on Data Mining. pp. 530 539 (2008)

[59] Shen, Z.C., Chu, C.C.N.: Bounds on the number of slicing, mosaic, and general ﬂoorplans. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22(10), 1354 1361 (2003)

[60] Tan, X., Rao, V., Neville, J.: Nested crp with hawkes-gaussian processes. In: Storkey, A., Perez-Cruz, F. (eds.) Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 84, pp. 1289 1298 (2018)

[61] Teh, Y.W., Jordan, M.I., Beal, M., Blei, D.: Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 1566 1581 (2006)

[62] Yao, B., Chen, H., Cheng, C.K., Graham, R.L.: Floorplan representations: Complexity and connections. ACM Transactions on Design Automation of Electronic Systems 8(1), 55 80 (2003)

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] See Section 1. We have clariﬁed the main contributions of this paper under three headings. (b) Did you describe the limitations of your work? [Yes] See Section 4. We have simultaneously clariﬁed the technical imperfection of our proposed PCRP and proposed a reasonable proposal to solve it. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Ethical perspective in Supplementary material. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Section 4 is an intuitive sketch of the theoretical results. Details are given in the supplementary material.

(b) Did you include complete proofs of all theoretical results? [Yes] See Supplementary material. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See experimental settings in Supplementary material (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See experimental settings in Supplementary material 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Relational models in Section 5 (b) Did you mention the license of the assets? [Yes] See Datasets in Section 5

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] See Datasets in Section 5 (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [Yes] See Datasets in Section 5 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]