# provable_convex_coclustering_of_tensors__4ecc5e1b.pdf Journal of Machine Learning Research 21 (2020) 1-58 Submitted 3/18; Revised 2/20; Published 10/20 Provable Convex Co-clustering of Tensors Eric C. Chi eric chi@ncsu.edu Department of Statistics North Carolina State University Raleigh, NC 27695, USA Brian R. Gaines brian.gaines@sas.com Advanced Analytics R&D SAS Institute Inc. Cary, NC 27513, USA Will Wei Sun sun244@purdue.edu Krannert School of Management Purdue University West Lafayette, IN 47907, USA Hua Zhou huazhou@ucla.edu Department of Biostatistics University of California Los Angeles, CA 90095, USA Jian Yang jianyang@oath.com Advertising Sciences Yahoo Research Sunnyvale, CA 94089, USA Editor: Francis Bach Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (Co Co) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the Co Co estimator, which reveals a surprising blessing of dimensionality phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the Co Co estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness. Keywords: Clustering, Fused lasso, High-dimensional Statistical Learning, Multiway Data, Non-asymptotic Error c 2020 Eric C. Chi, Brian R. Gaines, Will Wei Sun, Hua Zhou, and Jian Yang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-155.html. Chi, Gaines, Sun, Zhou, and Yang 1. Introduction In this work, we study the problem of finding structure in multiway data, or tensors, via clustering. Tensors appear frequently in modern scientific and business applications involving complex heterogeneous data. For example, data in a neurogenomics study of brain development consists of a 3-way array of expression level measurements indexed by gene, space, and time (Liu et al., 2017). Other examples of 3-way data arrays consisting of matrices collected over time include email communications (sender, recipient, time) (Papalexakis et al., 2013), online chatroom communications (user, keyword, time) (Acar et al., 2006), bike rentals (source station, destination station, time) (Guigour es et al., 2015), and internet network traffic (source IP, destination IP, time) (Sun et al., 2006). The rise in tensor data has created new challenges in making predictions, such as in recommender systems for example (Zheng et al., 2016; Symeonidis, 2016; Symeonidis and Zioupos, 2016; Frolov and Oseledets, 2017; Bi et al., 2018) as well as inferring latent structure in multiway data (Acar and Yener, 2009; Anandkumar et al., 2014; Cichocki et al., 2015; Sidiropoulos et al., 2017). As tensors become increasingly more common, the need for a reliable co-clustering method grows increasingly more urgent. Prevalent clustering methods, however, mainly focus on vector or matrix-variate data. The goal of vector clustering is to identify subgroups within the vector-variate observations (Ma and Zhong, 2008; Shen and Huang, 2010; Shen et al., 2012; Wang et al., 2013). Biclustering is the extension of clustering to twoway data where both the observations (rows) and the features (columns) of a data matrix are simultaneously grouped together (Hartigan, 1972; Madeira and Oliveira, 2004; Busygin et al., 2008). In spite of their prevalence, these approaches are not directly applicable to the cluster analysis of general-order (general-way) tensors. On the other hand, existing methods for co-clustering general D-way arrays, for D 3, employ one of three strategies: (i) extensions of spectral clustering to tensors (Wu et al., 2016b), (ii) directly clustering the subarrays along each dimension, or way, of the tensor using either k-means or variants on it (Jegelka et al., 2009), and (iii) low rank tensor decompositions (Sun et al., 2009; Papalexakis et al., 2013; Zhao et al., 2016). While all these existing approaches may demonstrate good empirical performance, they have limitations. For instance, the spectral co-clustering method proposed by Wu et al. (2016b) is limited to nonnegative tensors and the Co Te C method proposed by Jegelka et al. (2009), like k-means, requires specifying the number of clusters along each dimension as a tuning parameter. Most importantly, none of the existing methods provide statistical guarantees for recovering an underlying co-clustering structure. There is a conspicuous gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of the non-convex formulations of the previously mentioned works. In this paper, we propose a Convex Co-clustering (Co Co) procedure that solves a convex formulation of the problem of co-clustering a D-way array for D 3. Our proposed Co Co estimator affords the following advantages over existing tensor co-clustering methods. (i) Under modest assumptions on the data generating process, the Co Co estimator is guaranteed to recover an underlying co-clustering structure with high probability. In particular, we establish a non-asymptotic error bound for the Co Co estimator, which reveals a surprising blessing of dimensionality phenomenon: As the dimensions of the array increase, the Co Co estimator is still consistent even if the number of Provable Convex Co-clustering of Tensors underlying co-clusters grows as a function of the number of elements in the tensor sample. More importantly, an underlying co-clustering structure can be consistently recovered with even a single tensor sample, which is a typical case in real applications. This phenomenon does not exist in vector or matrix-variate cluster analysis. (ii) The Co Co estimator possesses stability guarantees. In particular, the Co Co estimator is Lipschitz continuous in the data and jointly continuous in the data and its tuning parameter. We emphasize that Lipschitz continuity in the data guarantees that perturbations in the data lead to graceful and commensurate variations in the cluster assignments, and the continuity in the tuning parameter can be leveraged to expedite computation through warm starts. (iii) The Co Co estimator can be iteratively computed with convergence guarantees via an accelerated first order method with storage and per-iteration cost that is linear in the size of the data. In short, the Co Co estimator comes with (i) statistical guarantees, (ii) practically relevant stability guarantees at all sample sizes, and (iii) an algorithm with polynomial complexity. The theoretical properties of our Co Co estimator are supported by extensive simulation studies. To demonstrate its business impact, we apply the Co Co estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to help advertising planning. Our work is related to, but also clearly distinct from, a number of recent developments in cluster analysis. The first related line of research tackles convex clustering (Hocking et al., 2011; Zhu et al., 2014; Chi and Lange, 2015; Chen et al., 2015; Tan and Witten, 2015; Wang et al., 2018; Radchenko and Mukherjee, 2017) and convex biclustering (Chi et al., 2017). These existing methods are not directly applicable to general-order tensors, however. Importantly, our Co Co estimator enjoys a unique blessing of dimensionality phenomenon that has not been established in the aforementioned approaches. Moreover, the Co Co estimator is similar in spirit to a recent series of work approximating a noisy observed array with an array that is smooth with respect to some latent organization associated with each dimension of the array (Gavish and Coifman, 2012; Ankenman, 2014; Mishne et al., 2016; Yair et al., 2017). Our proposed Co Co procedure seeks an approximating array that is smooth with respect to a latent clustering along each dimension of the array. While Co Co shares features with these array approximation techniques, namely the use of data-driven similarity graphs along tensor modes, a key distinction between our Co Co estimator and these methods is that Co Co produces an approximating array that explicitly recovers hard co-clustering assignments. As we will see shortly, focusing our attention in this work on the co-clustering model paves the way to the discovery and explicit characterization of new and interesting fundamental behavior in finding intrinsic organization within tensors. The rest of the paper is organized as follows. In Section 2, we review standard facts and results about tensors that we will use. In Section 3, we introduce our convex formulation of the co-clustering problem. In Section 4, we establish the stability properties and prediction error bounds of the Co Co estimator. In Section 5, we describe the algorithm used to compute the Co Co estimator. In Section 6, we discuss how to specify weights used in our Co Co estimator, and in Section 7 we give guidance on how to set and select tuning Chi, Gaines, Sun, Zhou, and Yang parameters used in the Co Co estimator in practice. In Section 8, we present simulation results. In Section 9, we discuss the results of applying the Co Co estimator to co-cluster a real data tensor from online advertising. In Section 10, we close with a discussion. The Appendix contains a brief review of the two main tensor decompositions that are discussed in this paper, all technical proofs, as well as additional experiments. 2. Preliminaries 2.1 Notation We adopt the terminology and notation used by Kolda and Bader (2009). We call the number of ways or modes of a tensor its order. Vectors are tensors of order one and denoted by boldface lowercase letters, e.g. a. Matrices are tensors of order two and denoted by boldface capital letters, e.g. A. Tensors of higher-order, namely order three and greater, we denote by boldface Euler script letters, e.g. A. Thus, if A represent a D-way data array of size n1 n2 n D, we say A is a tensor of order D. We denote scalars by lowercase letters, e.g. a. We denote the ith element of a vector a by ai, the ijth element of a matrix A by aij, the ijkth element of a third-order tensor A by aijk, and so on. We can extract a subarray of a tensor by fixing a subset of its indices. For example, by fixing the first index of a matrix to be i, we extract the ith row of the matrix, and by fixing the second index of a matrix to be j, we extract a jth column of the matrix. We use a colon to indicate all elements of a mode. Consequently, we denote the ith row of a matrix A by Ai: and the jth column of a matrix A by A:j. Fibers are the subarrays of a tensor obtained by fixing all but one of its indices. In the case of a matrix, a mode-1 fiber is a matrix column and a mode-2 fiber is a matrix row. Slices are the two-dimensional subarrays of a tensor obtained by fixing all but two indices. For example, a third-order tensor A has three sets of slices denoted by Ai::, A:j:, and A::k. 2.2 Basic Tensor Operations It is often convenient to reorder the elements of a D-way array into a matrix or a vector. Reordering a tensor s elements into a matrix is referred to as matricization, while reordering its elements into a vector is referred to as vectorization. There are many ways to reorder a tensor into a matrix or vector. In this paper, we use a canonical mode-d matricization, where the mode-d fibers of a D-way tensor A Rn1 n2 n D become the columns of a matrix A(d) Rnd n d, where n d = Q j =d nj. Recall that the column-major vectorization of a matrix maps a matrix A Rp q to the vector a Rpq by stacking the columns of A on top of each other, namely a = AT :1 AT :2 AT :q T Rpq. In this paper, we take the vectorization of a D-way tensor A, denoted vec(A), to be the column-major vectorization of the mode-1 matriciziation of A, namely vec(A) = vec(A(1)) Rn, where n = Q d nd the total number of elements in A. As a shorthand, when the context leaves no ambiguity, we denote this vectorization of a tensor A by its boldface lowercase version a. The Frobenius norm of a D-way tensor A Rn1 n2 n D is the natural generalization of the Frobenius norm of a matrix, namely it is the square root of the sum of the squares Provable Convex Co-clustering of Tensors of all its elements, i D=1 a2 i1i2 i D. The Frobenius norm of a tensor is equivalent to the ℓ2-norm of the vectorization of the tensor, namely A F = a 2. Let A be a tensor in Rn1 n2 n D and B be a matrix in Rm nd. The d-mode (matrix) product of the tensor A with the matrix B, denoted by A d B, is the tensor of size n1 nd 1 m nd+1 n D whose (i1, i2, , id 1, j, id+1, , i D)th element is given by (A d B)i1...id 1jid+1 i D = id=1 ai1i2 i Dbjid, for j {1, . . . , m}. The vectorization of the d-mode product A d B can be expressed as vec(A d B) = (In D Ind+1 B Ind 1 In1)a, (1) where Ip is the p-by-p identity matrix and denotes the Kronecker product between two matrices. The identity given in (1) generalizes the well known formula for the column-major vectorization of a product of two matrices, namely vec(BA) = (I B)a. 3. A Convex Formulation of Co-clustering We first consider a convex formulation of co-clustering problem when the data is a 3-way tensor X Rn1 n2 n3 before discussing the natural generalization to D-way tensors. Our basic assumption is that the observed data tensor is a noisy realization of an underlying tensor that exhibits a checkerbox structure modulo some unknown reordering along each of its modes. Specifically suppose that there are k1, k2, and k3 clusters along modes 1, 2, and 3 respectively. If the (i1, i2, i3)-th entry in X belongs to the cluster defined by the r1th mode-1 group, r2th mode-2 group, and r3th mode-3 group, then we assume that the observed tensor element xi1i2i3 is given by xi1i2i3 = c r1r2r3 + ϵi1i2i3, (2) where c r1r2r3 is the mean of the co-cluster defined by the r1th mode-1 partition, r2th mode2 partition, and r3th mode-3 partition, and ϵi1i2i3 are noise terms. We will specify a joint distribution on the noise terms later in Section 4.2 in order to derive prediction bounds. Thus, we model the observed tensor X as the sum of a mean tensor U Rn1 n2 n3, whose elements are expanded from the co-cluster means tensor C Rk1 k2 k3, and a noise tensor E Rn1 n2 n3. We can write this expansion explicitly by introducing a membership matrix Md {0, 1}nd kd for the dth mode, where the ikth element of Md is one if and only if the ith mode-d slice belongs to the kth mode-d cluster for k {1, . . . , kd}. We require that each row of the membership matrix sum to one, namely Md1 = 1, to ensure that each of the mode-d slices belongs to exactly one of the kd mode-d clusters. Then, U = C 1 M1 2 M2 3 M3. Chi, Gaines, Sun, Zhou, and Yang Figure 1: A 3-way tensor with a checkerbox structure Figure 1 illustrates an underlying mean tensor U after permuting the slices along each of the modes to reveal a checkerbox structure. The co-clustering model in (2) is the 3-way analogue of the checkerboard mean model often employed in biclustering data matrices (Madeira and Oliveira, 2004; Tan and Witten, 2014; Chi et al., 2017). Moreover, the tensor C of co-cluster means corresponds to the tensor of cluster centers in the tensor clustering work by Jegelka et al. (2009). The model is complete and exclusive in that each tensor element is assigned to exactly one co-cluster. This is in contrast to models that allow potentially overlapping co-clusters (Lazzeroni and Owen, 2002; Bergmann et al., 2003; Turner et al., 2005; Huang et al., 2008; Witten et al., 2009; Lee et al., 2010; Sill et al., 2011; Bhar et al., 2015). Estimating the model in (2) consists of finding (i) the partitions along each mode and (ii) the mean values of each of the k1k2k3 co-clusters. Estimating c r1r2r3, given the mode clustering assignments is trivial. Let G1, G2, and G3 denote the indices of the r1th mode-1, r2th mode-2, and r3th mode-3 groups respectively. If the noise terms ϵi1i2i3 are iid N(0, σ2) for some positive σ2, then the maximum likelihood estimate of c r1r2r3 is simply the sample mean of the entries of X over the indices defined by G1, G2, and G3, namely ˆc r1r2r3 = 1 |G1||G2|||G3| i3 G3 xi1i2i3. Finding the partitions G1, G2, and G3, on the other hand, is a combinatorially hard problem. In recent years, however, many combinatorially hard problems, that initially appear computationally intractable, have been successfully attacked by solving a convex relaxation to the original combinatorial optimization problem. Perhaps the most celebrated convex relaxations is the lasso (Tibshirani, 1996), which simultaneously performs variable selection and parameter estimation for fitting sparse regression models by minimizing a non-smooth convex criterion. In light of the lasso s success, we propose to simultaneously identify partitions along the modes of X and estimate the co-cluster means by minimizing the following convex objective function Fγ(U) = 1 2 X U 2 F + γ R1(U) + R2(U) + R3(U) | {z } R(U) Provable Convex Co-clustering of Tensors i w1,i j , then there will be more pressure for Ui:: and Uj:: to fuse than for Ui :: and Uj :: to fuse as γ increases. Thus, the weight wd,ij quantifies the similarity between the ith and jth mode-d slices. A very large wd,ij indicates that the two slices are very similar, while a very small wd,ij indicates that they are very dissimilar. These pairwise similarities motivate a graphical view of clustering. For the dth mode, define the set Ed as the edge set of a similarity graph. Each slice is a node in the graph and the set Ed contains an edge (i, j) if and only if wd,ij > 0. Figure 2 shows an example of a mode-1 similarity Chi, Gaines, Sun, Zhou, and Yang Figure 2: A graph that summarizes the similarities between pairs of the mode-1 subarrays. Only edges with positive weight are drawn. graph, which corresponds to a tensor with seven mode-1 slices and positive weights that define the edge set E1 = {(1, 2), (2, 3), (4, 5), (4, 6), (6, 7)}. Given the connectivity of the graph, as γ increases, the slices U1::, U2::, and U3:: will be shrunk towards each other while the slices U4::, U5::, U6:: and U7:: shrunk towards each other. Since wd,ij = 0 for any (i, j) / Ed, we can express the penalty terms for the dth mode as (i,j) Ed wd,ij Ui:: Uj:: F. The graph in Figure 2 makes readily apparent that the convex objective in (3) separates over the connected components of the similarity graph for the mode-d slices. Consequently, one can solve for the optimal U component by component. Without loss of generality, we assume that the weights are such that all the similarity graphs are connected. Before leaving this preliminary description of the weights, however, we want to emphasize that in practice weights are set once in a data-adaptive manner and should be considered empirically chosen hyper-parameters rather than tuning parameters. Further discussion of the weights and practical recommendations for specifying them will be discussed in Section 6. Having familiarized ourselves with the convex co-clustering of a 3-way array, we now present the natural extension of (3) for clustering the fibers of a general higher-order tensor X Rn1 n D along all its D modes. Let d,ij = e T i e T j where ei is the ith standard basis vector in Rnd. The objective function of our convex co-clustering for a general higher-order tensor is as follows. Fγ(U) = 1 2 X U 2 F + γ (i,j) Ed wd,ij U d d,ij F. (4) The difference between the convex triclustering objective (3) and the general convex co-clustering objective (4) is in the penalty terms. Previously in (3) we penalized the difference between pairs slices whereas in (4) we penalize the differences between pairs of mode-d subarrays. Note that the function Fγ(U) defined in (4) has a unique global minimizer. This follows immediately from the fact that Fγ(U) is strongly convex. The unique global minimizer of Provable Convex Co-clustering of Tensors Fγ(U) is our proposed Co Co estimator, which is denoted by ˆU for the remainder of the paper. At times it will be more convenient to work with vectors rather than tensors. By applying the identity in (1), we can rewrite the objective function in (4) in terms of the vectorizations of U and X as follows Fγ(u) = 1 2 x u 2 2 + γ (i,j) Ed wd,ij Ad,iju 2. (5) where Ad,ij is the n d-by-n matrix Ad,ij = In D Ind+1 d,ij Ind 1 In1 (6) where Ind is the nd-by-nd identity matrix. We will refer to the unique global minimizer of (5), ˆu = arg minu Fγ(u), as the vectorized version of our Co Co estimator. Remark 1 The fusion penalties Rd(U) are a composition of the group lasso (Yuan and Lin, 2006) and the fused lasso (Tibshirani et al., 2005), a special case of the generalized lasso (Tibshirani and Taylor, 2011). When only a single mode is being clustered and only one of the terms Rd(U) is employed, we recover the objective function in the convex clustering problem (Pelckmans et al., 2005; She, 2010; Lindsten et al., 2011; Hocking et al., 2011; Sharpnack et al., 2012; Zhu et al., 2014; Chi and Lange, 2015; Radchenko and Mukherjee, 2017). Most prior work on convex clustering employ an element-wise ℓ1-norm penalty on pairwise differences, as in the original fused lasso, however, ℓ2-norm and ℓ -norm have also been considered (Hocking et al., 2011; Chi and Lange, 2015). In this paper, we restrict ourselves to the ℓ2-norm for two reasons. First, the ℓ2-norm is rotationally invariant. In general, we are reluctant to adopt a procedure whose co-clustering output may non-trivially change when the coordinate representation of the data along one of its modes is trivially changed. Second, the ℓ2-norm promotes the group-wise shrinkage of pairwise differences of subarrays along each mode leading to more straightforward partitioning along each mode. Pairwise differences are either exactly zero or not. When the tensor is a matrix and the rows and columns are being simultaneously clustered, we recover the objective function in the convex biclustering problem (Chi et al., 2017). In general, the fusion penalties Rd(U) shrink solutions to vector valued functions that are piece-wise constant over the mode-d similarity graph defined by the weights wd,ij. Viewed this way, we can see our approach as simultaneously performing the network lasso (Hallac et al., 2015) on D similarity graphs. Remark 2 The Co Co estimator is invariant to permutations in the data tensor X in the following sense. Suppose ˆU and ˆU are the Co Co estimators when the data tensors are respectively X and X = X 1Π1 2 DΠD where Π1 {0, 1}n1 n1, . . . , ΠD {0, 1}n D n D are permutation matrices, namely ΠT d Πd = I. In words, X can be obtained from X by permuting the subarrays of X along the dth mode according to Πd for d = 1, . . . , D, and X can be recovered from X by permuting along the dth mode according to ΠT d for d = 1, . . . , D. Since U 1 Π1 2 D ΠD F = U F, it follows that ˆU = ˆU 1 Π1 2 D ΠD and ˆU = ˆU 1 ΠT 1 2 D ΠT D. Chi, Gaines, Sun, Zhou, and Yang Permutation invariance is important because it means that the Co Co estimator is essentially unaltered by any reshuffling along the modes of the data tensor. Remark 3 Given the co-clustering structure assumed in (2), one may wonder how much is added by explicitly seeking a co-clustering over clustering along each mode independently. In other words, why not solve D independent convex clustering problems with R(U) = Rd(U)? To provide some intuition on why co-clustering should be preferred over independently clustering each mode, consider the following problem. Imagine trying to cluster row vectors xi R10,000 for i = 1, . . . , 100 drawn from a two-component mixture of Gaussians, namely xi iid 1 2N(µ, σ2I) + 1 2N(ν, σ2I). This is a challenging clustering problem due to the disproportionately small number of observations compared to the number of features. If, however, we were told that µj = µ1 and νj = ν1 for j = 1, . . . , 5, 000 and µj = µ2 and νj = ν2 for i = 5, 001, . . . , 10, 000, in other words that the features were clustered into two groups, our fortunes have reversed and we now have an abundance of observations compared to the number of effective features. Even if we lack a clear-cut clustering structure in the features, this example suggests that leveraging similarity structure along the columns can expedite identifying similarity structure along the rows, and vice versa. Indeed, if there is an underlying checkerbox mean tensor we may expect that simultaneously clustering along each mode should make the task of clustering along any one given mode easier. Our prediction error result presented in Section 4.2 in fact supports this suspicion (See Remark 10). 4. Properties We first discuss how the Co Co estimator ˆU behaves as a function of the data tensor X, the tuning parameter γ, and the weights wd,ij. We will then present its statistical properties under mild conditions on the data generating process. We highlight that these properties hold regardless of the algorithm used to minimize (4), as they are intrinsic to its convex formulation. All proofs are given in Appendix B and Appendix C. 4.1 Stability Properties The Co Co estimator varies smoothly with respect to X, γ, and {wd,ij}. Let Wd = {wd,ij} denote the weights matrix for mode d. Proposition 4 The minimizer ˆU of (4) is jointly continuous in (X, γ, W1, W2, . . . , WD). As noted earlier, in practice we will typically fix the weights wd,ij and compute the Co Co estimator over a grid of the penalization parameters γ in order to select a final Co Co estimator from among the computed candidate estimators of varying levels of smoothness. Since (4) does not admit a closed form minimizer, we resort to iterative algorithms for computing the Co Co estimator. Continuity of ˆU in γ can be leveraged to expedite computation through warm starts, namely using the solution ˆUγ as the initial guess for iteratively computing ˆUγ where γ is slightly larger or smaller than γ. Due to the continuity of ˆU in γ, small changes Provable Convex Co-clustering of Tensors in γ will result in small changes in ˆU. Empirically the use of warm starts can lead to a non-trivial reduction in computation time (Chi and Lange, 2015). From the continuity in γ, we also see that convex co-clustering performs continuous co-clustering just as the lasso (Tibshirani, 1996) performs continuous variable selection. The penalization parameter γ tunes the complexity of the Co Co estimator. Clearly when γ = 0, the Co Co estimator coincides with the data tensor, namely ˆU = X. The key to understanding the Co Co estimator s behavior as γ increases is to recognize that the penalty functions Rd(U) are semi-norms. Under suitable conditions on the weights given in Assumption 4.1 below, Rd(U) vanishes if and only if the mode-d subarrays of U are identical. Assumption 4.1 For any pair of mode-d subarrays, indexed by i and j with i < j, there exists a sequence of indices i k l j along which the weights, wd,ik, . . . , wd,lj are positive. Proposition 5 Under Assumption 4.1, Rd(U) = 0 if and only if U(d) = 1c T for some c Rn d. To give some intuition for Proposition 5, note that the term Rd(U) separates over the connected components of the mode-d similarity graph. Therefore, the term Rd(U) penalizes variation in the mode-d subarrays over the connected components of the mode-d similarity graph. Assumption 4.1, states that the mode-d similarity graph is connected. Thus, the only way for Rd(U) to attain its minimum value and vanish under Assumption 4.1, is if there is no variation in U along its mode-d subarrays. Proposition 5 suggests that if Assumption 4.1 holds for all d = 1, . . . , D then as γ increases the Co Co estimator converges to the solution of the following constrained optimization problem: min u 1 2 x u 2 F subject to u = c1 for some c R, the solution to which is just the global mean x, whose entries are all identically the average value of x over all its entries. The next result formalizes our intuition that as γ increases, the Co Co estimator will eventually coincide with x. Proposition 6 Suppose Assumption 4.1 holds for d = 1, . . . , D, then Fγ(U) is minimized by the grand mean X for γ sufficiently large. Thus, as γ increases from 0, the Co Co estimator ˆU traces a continuous solution path that starts from n co-clusters, consisting of ui1 i D = xi1 i D, to a single co-cluster, where ui1 i D = x T1/n for all i1, . . . , i D. For a fixed γ, we can derive an explicit bound on sensitivity of the Co Co estimator to perturbations in the data. Proposition 7 The minimizer ˆU of (4) is a nonexpansive or 1-Lipschitz function of the data tensor X, namely Chi, Gaines, Sun, Zhou, and Yang ˆU(X) ˆU( X) F X X F. Nonexpansivity of ˆU in X provides an attractive stability result. Since ˆU varies smoothly with the data, small perturbations in the data are guaranteed to not lead to large variability of ˆU, or consequently large variability in the cluster assignments. In a special case of our method, Chi et al. (2017) showed empirically that the co-clustering assignments made by the 2-way version of the Co Co estimator was noticeably less sensitive to perturbations in the data than those made by several existing biclustering algorithms. 4.2 Statistical Properties We next provide a finite sample bound for the prediction error of the Co Co estimator. For simplicity, we consider the case where we take uniform weights within a mode in (5), namely wd,ij = wd,i j = 1/nd for all i, j, i , j {1, . . . , nd}. Such uniform weight assumption has also been imposed in the analysis of the vector-version of convex clustering (Tan and Witten, 2015). In order to derive the estimation error of ˆu, we first define an important definition for the noise and introduce two regularity conditions. Definition 8 (Vu and Wang (2015)) We say a random vector y Rn is M-concentrated if there are constants C1, C2 > 0 such that for any convex, 1-Lipschitz function φ : Rn R and any t > 0, P φ(y) E[φ(y)] t C1 exp C2t2 The M-concentrated random variable is more general than the Gaussian or sub-Gaussian random variables, and it allows dependence in its coordinates. Vu and Wang (2015) provided a few examples of M-concentrated random variables. For instance, if the coordinates of y are iid standard Gaussian, then y is 1-concentrated. If the coordinates of y are independent and M-bounded, then y is M-concentrated. If the coordinates of y come from a random walk with certain mixing properties, then y is M-concentrated for some M. Assumption 4.2 (Model) We assume the true cluster center C Rk1 k D has a checkerbox structure such that the mode-d subarrays have kd different values (number of clusters along the dth mode), and each entry of C is bounded above by a constant C0 > 0. Define U Rn1 n D as the true parameter expanded based on C , namely U = C 1 M1 2 M2 3 D MD, where Md {0, 1}nd kd are binary mode-d cluster membership matrices such that Md1 = 1. Denote u = vec(U ) Rn with n = QD d=1 nd. We assume the samples belonging to the (r1, . . . , r D)-th cluster satisfy xi1,...,i D = c r1,...,r D + ϵi1,...,i D, with id {1, . . . , nd} and rd {1, . . . , kd}. Furthermore, we assume ϵ = vec(E) is a M-concentrated random variable defined in (8) with mean zero. Provable Convex Co-clustering of Tensors The checkerbox means model in Assumption 4.2 provides the underlying cluster structure of the tensor data. As a special case, Assumption 4.2 with D = 2 reduces to the model assumption underlying convex biclustering (Chi et al., 2017). In contrast to the independent sub-Gaussian condition assumed in vector-version convex clustering (Tan and Witten, 2015), our error condition is much weaker since we allow for non-sub-Gaussian distributions as well as allow for dependence among its coordinates. Assumption 4.3 (Tuning) The tuning parameter γ satisfies D γ 2c0 log(n) n for some constant c0 > 1. Theorem 9 Suppose that Assumption 4.2 and Assumption 4.3 hold. The estimation error of ˆu in (5) with uniform weights satisfies, ˆu u 2 2 1 D nd + log(n) nnd j =d kj, (7) with a high probability, where C = 12c0C2 0 is a positive constant, and kd is the true number of clusters in the dth mode. Theorem 9 provides a finite sample error bound for the proposed Co Co tensor estimator. Our theoretical bound allows the number of clusters in each mode to diverge, which reflects a typical large-scale clustering scenario in big tensor data. A notable consequence of Theorem 9 is that, when D 3, namely a higher-order tensor with at least 3 modes, the Co Co estimator can achieve estimation consistency along all the D modes even when we only have one tensor sample. Here the sample size refers to the number of available tensor samples. In our tensor clustering problem, we only have access to one tensor sample. This property is uniquely enjoyed by co-clustering of tensor data with D 3, and has not been previously established in the existing literature on vector clustering or biclustering. To see this, when nd are of the same order as n0, and kd are of the same order as k0, a sufficient condition for the consistency is that n0 and k0 = o n(D 2)/(D 1) 0 up to a log term. When D = 3, the Co Co estimator is consistent so long as the number of clusters k0 in each mode diverges slightly slower than n0. Remarkably, as we have more modes in the tensor data, this constraint on the rate of divergence of k0 gets weaker. In short, we reap a unique and surprisingly welcome blessing of dimensionality phenomenon in the tensor co-clustering problem. Remark 10 Next we discuss the connections of our bound (7) with prior results in the literature. An intermediate step in the proof of Theorem 9 indicates that the estimation error in the dth mode is on the order of 1/nd + log(n)/ nnd + log(n) q nd Q j =d kj/n d. In the clustering along the rows of a data matrix, our rate matches with that established for vector-version convex clustering (Tan and Witten, 2015), up to a log term p log(n). Such Chi, Gaines, Sun, Zhou, and Yang a log term is due to that fact that Tan and Witten (2015) considers the error to be iid sub Gaussian while we consider a general M-concentrated error. In practice, the iid assumption on the noise ϵ = vec(E) could be restrictive. Consequently, our theoretical analysis is built upon a new concentration inequality of quadratic forms recently developed in Vu and Wang (2015). In addition, our rate reveals an interesting theoretical property of the convex biclustering method proposed by Chi et al. (2017). When D = 2, our rate indicates that the estimation error along the row and column of the data matrix is log(n1n2) p n1k2/n2 and log(n1n2) p n2k1/n1, respectively. Clearly, both errors can not converge to zero simultaneously. This indicates a disadvantage of matricizing a data tensor for co-clustering. 5. Estimation Algorithm We next discuss a simple first order method for computing the solution to the convex coclustering problem. The proposed algorithm generalizes the variable splitting approach introduced for convex clustering problem described in Chi and Lange (2015) to the Co Co problem. The key observation is that the Lagrangian dual of an equivalent formulation of the convex co-clustering problem is a constrained least squares problem that can be iteratively solved using the classic projected gradient algorithm. 5.1 A Lagrangian Dual of the Co Co Problem Recall that we seek to minimize the objective function in (5) Fγ(u) = 1 2 x u 2 2 + γ l Ed wd,l Ad,lu 2. Note that we have enumerated the edge indices in Ed to simplify the notation for the following derivation. We perform variable splitting and introduce the dummy variables vd,l = Ad,lu. Let Vd denote the n d |Ed| matrix whose lth column is vd,l. Further denote the vectorization of Vd by vd = vec(Vd) and let v = v T 1 v T 2 v T D T denote the vector obtained by stacking the vectors vd on top of each other. We now solve the equivalent equality constrained minimization min v,u 1 2 x u 2 2 + γ l Ed wd,l vd,l 2 subject to vd = Adu, where Ad = (In D Ind+1 Φd Ind 1 In1) and Φd is the oriented edge-vertex incidence matrix for the dth mode graph, namely 1 If node v is the head of edge l 1 If node v is the tail of edge l 0 otherwise. We introduce dual variables λd corresponding to the equality constraint vd = Adu. Let Λd denote the n d |Ed| matrix whose lth column is λd,l. Further denote the vectorization Provable Convex Co-clustering of Tensors of Λd by λd = vec(Λd) and λ = λT 1 λT 2 λT D T. The Lagrangian dual objective is given by G(λ) = 1 2 x 2 2 1 2 x ATλ 2 2 l Ed ιCd,l(λd,l), where A = AT 1 AT 2 AT D T and ιCd,l is the indicator function of the closed convex set Cd,l = {z : z 2 γwd,l}, namely ιCd,l is the function that vanishes on the set of Cd,l and is infinity on the complement of Cd,l. Details on the derivation of the dual objective G(λ) are provided in Appendix D. Maximizing the dual objective G(λ) is equivalent to solving the following constrained least squares problem: min λ C 1 2 x ATλ 2 2, (8) where C = {λ : λd,l Cd,l, l Ed, d = 1, . . . , D}. We can recover the primal solution via the relationship: ˆu = x ATˆλ, where ˆλ is a solution to the dual problem (8). The dual problem (8) has at least one solution by the Weierstrass extreme value theorem, but the solution may not be unique since AT has a non-trivial kernel. Nonetheless, our Co Co estimator ˆu is still unique since ATˆλ1 = ATˆλ2 for any solutions ˆλ1, ˆλ2 to the problem (8). We numerically solve the constrained least squares problem in (8) with the projected gradient algorithm, which alternates between taking a gradient step and projecting onto the set C. Algorithm 1 provides pseudocode of the projected gradient algorithm, which has several good features. The projected gradient algorithm is guaranteed to converge to a global minimizer of (8). Its per-iteration and storage costs using the weight choices, described in Section 6, are both O(Dn), namely linear in either the number of dimensions D or in the number of elements n. For a modest additional computational and storage cost, we can accelerate the projected gradient method, for example with FISTA (Beck and Teboulle, 2009) or Spa RSA (Wright et al., 2009). In our experiments, we use a version of the latter, namely FASTA (Goldstein et al., 2014, 2015). Additional details on the derivation of the algorithmic updates, convergence guarantees, computational and storage costs, as well as stopping rules can be found in Appendix E. 6. Specifying Non-Uniform Weights In Section 4.2, we assumed uniform weights wd,ij in the penalty terms Rd(U) to establish a prediction error bound, which revealed a surprising and beneficial blessing of dimensionality phenomenon. Although this simplifying assumption gives clarity and insight into how the co-clustering problem gets easier as the number of modes increases, in practice choosing non-uniform weights can substantially improve the quality of the clustering results. In the context of convex clustering, Chen et al. (2015) and Chi and Lange (2015) provided Chi, Gaines, Sun, Zhou, and Yang Algorithm 1 Convex Co-Clustering (Co Co) Estimation Algorithm Initialize λ(0); for m = 0, 1, . . . u(m+1) = x ATλ(m) Gradient Step for d = 1, . . . , D do for l Ed do λ(m+1) d,l = PCd,l λ(m) d,l + ηAd,lu(m+1) Projection Step end for end for until convergence 163 243 323 403 483 563 Number Observations, n Adjusted Rand Index Co Co Non Uniform Weight Co Co Uniform Weight Figure 3: Uniform versus non-uniform weights: Average Adjusted Rand Index for an increasing size. Here n = n3 0 refers to a tensor of size n0 n0 n0. empirical evidence that convex clustering with uniform weights struggled to produce exact sparsity in the pairwise differences of smooth estimates when there was not a strong separation between groups. Indeed, similar phenomena were observed in earlier work on the related clustered lasso (She, 2010). Several related works (She, 2010; Hocking et al., 2011; Chen et al., 2015; Chi and Lange, 2015) recommend a weight assignment strategy described below. In addition, the use of sparse weights can also lead to non-trivial improvements in both computational time and clustering performance (Chi and Lange, 2015; Chi et al., 2017). To illustrate the practical value of non-uniform weights, we compare Co Co s ability to recover co-clusters, using both uniform and non-uniform weights, as the size of a 3-way tensor increases when there are two clusters per mode with balanced cluster sizes along each mode. We assess the quality of the recovered clustering performance using the Adjusted Rand Index (ARI). The ARI (Hubert and Arabie, 1985) varies between -1 and 1, where 1 indicates a perfect match between two clustering assignments whereas a value close to zero indicates the two clustering assignments match about as might be expected if they were both randomly generated. Negative values indicate that there is less agreement between clusterings than expected from random partitions. Provable Convex Co-clustering of Tensors Figure 3 shows a comparison between using non-uniform weights that are described in Section 6.2 and uniform weights. Each plotted point in Figure 3 is the average ARI over 100 replicates. For Co Co using non-uniform weights, the smoothing parameter γ is chosen with the data-driven extended BIC method that is detailed in Section 7.1. In contrast, for Co Co using uniform weights, γ is chosen as the value that produces the estimator that minimizes the true but unknown MSE. We see that while using uniform weights in Co Co leads to recovering co-clusters exactly once a sufficient number of samples have been acquired, using non-uniform weights enables Co Co to recover the co-clusters exactly with notably fewer samples. The results of this experiment are especially remarkable because Co Co using non-uniform weights and a dataadaptive choice of γ outperformed Co Co using uniform weights and an ideally chosen oracle value of γ. As in the case of convex clustering, using non-uniform weights can lead to significantly better performance over using uniform weights in practice. We give some explanation for why this is expected in Section 6.3 but leave it to future work to develop theory proving this performance improvement. Nonetheless based on this observation, we employ non-uniform weights in Co Co for the empirical studies presented later in the paper. 6.1 Basic Procedure for Specifying Weights We first describe our basic two step procedure for constructing weights before elaborating on the final refinements used in our numerical experiments. Step 1: We first calculate pre-weights wd,ij between the ith and jth mode-d subarrays as wd,ij = ιk {i,j} exp τd X(d),i: X(d),j: 2 F . (9) The first factor on the right hand side of equation (9), ιk {i,j}, is an indicator function that equals 1 if the jth slice is among the ith slice s k-nearest neighbors (or vice versa) and 0 othewise. The purpose of this term is to control the sparsity of the weights. The corresponding tuning parameter k influences the connectivity of the mode-d similarity graph. One can explore different levels of granularity in the clustering by varying k (Chen et al., 2015). As a default, one can use the smallest k such that the similarity graph is still connected. Note it is not necessary to calculate the exact k-nearest neighbors, which scales quadratically in the number of fibers in the mode. A fast approximation to the k-nearest neighbors is sufficient for the sake of inducing sparsity into the weights. Chi and Lange (2015) provided two reasons for using k-nearest neighbor weights. First, we wish to prioritize fusions between pairs of subarrays that are most similar; the subarrays that are most dissimilar should be the last pair of subarrays to fuse as the smoothing parameter γ increases. Second, we wish to use a sparse similarity graph as the computational and storage complexity of the estimation algorithm is proportional to the number of non-zero edges in the similarity graphs (Appendix E). Using k-nearest-neighbors weights accomplishes both goals. The second factor on the right hand side of equation (9) is the Gaussian kernel, which takes on larger values for pairs of mode-d subarrays that are more similar to each other. Chi and Steinerberger (2019) give a detailed theoretical justification for using weights like the Gaussian kernel weights in the context of convex clustering. For space considerations, Chi, Gaines, Sun, Zhou, and Yang we refer readers interested in these technical details to their work and give a brief intuitive rationale for the employing the Gaussian kernel here. Intuitively, the weights should be inversely proportional to the distance between the ith and jth mode-d subarrays (Chen et al., 2015; Chi et al., 2017). The inverse of the nonnegative parameter τd is a measure of scale. In practice, we can set it to be the median Euclidean distance between the ith and jth mode-d subarrays that are k-nearest neighbors of each other. A value of τd = 0 corresponds to uniform weights. Note that with minor modification, we can make the inverse scale parameter to be pair dependent as described in Zelnik-Manor and Perona (2005). Step 2: To obtain the mode-d weights wd,ij, we normalize the mode-d pre-weights wd,ij to sum to p nd/n. The normalization step puts the penalty terms Rd(U) on the same scale and ensures that clustering along any given single mode will not dominate the entire co-clustering as γ increases. 6.2 Improving Weights via the Tucker Decomposition In our preliminary experiments, we found that substituting a low-rank approximation of X, namely a Tucker decomposition X, in place of X in (9) led to a marked improvement in coclustering performance. To understand the boost in performance suppose that X = U + E with U having a checkerbox structure and the entries of E are iid N(0, σ2) for simplicity. Further suppose that the ith and jth mode-d subarrays of U belong to the same partition and ιk {i,j} = 1. Then wd,ij = exp τd E d ij 2 F = exp 2τdσ2Zd,ij , where Z = E d ij 2 F 2σ2 is distributed as a χ2 random variable with nd degrees of freedom. If we were able to perfectly denoise the tensor X so that σ = 0, then the pre-weight wd,ij would be set to its maximal value of 1, the ideal value for wd,ij since we have assumed the ith and jth mode-d subarrays belong to the same partition. Thus, if we can reduce σ2, namely denoise the observed tensor X, we can approach the ideal value of pre-weights. Note that we are more focused with approaching the ideal pre-weight values for pairs of subarrays that belong to the same partition and not concerned with pairs of subarrays in different partitions as the Gaussian kernel weights decay very rapidly. The Tucker decomposition is effective at reducing σ2 when U has a checkerbox pattern as the checkerbox pattern is a low-rank tensor that can be effectively approximated with the Tucker decomposition. Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. In our simulation studies described in Section 8, we use two different methods for choosing the rank as a robustness check to ensure our Co Co estimator s performance does not crucially depend on the rank selection method. Details on these two methods can be found in Appendix F. While we found the Tucker decomposition to work well in practice, we suspect that other methods of denoising the tensor may work just as well or could possibly be more effective. We leave it to future work to explore alternatives to the Tucker decomposition. Provable Convex Co-clustering of Tensors 6.3 Weights and Folded-Concave Penalties We conclude our discussion on weights by highlighting how they provide a connection between convex clustering and other penalized regression-based clustering methods that use folded-concave penalties (Pan et al., 2013; Xiang et al., 2013; Zhu et al., 2013; Marchetti and Zhou, 2014; Wu et al., 2016a). Suppose we seek to minimize the objective fγ(u) = 1 2 x u 2 2 + γ (i,j) Ed ϕd ( Ad,iju 2) , (10) where each ϕd : [0, ) 7 [0, ) has the following properties: (i) ϕd is concave and differentiable on (0, ), (ii) ϕd vanishes at the origin, and (iii) the directional derivative of ϕd exists and is positive at the origin. Such ϕd is collectively referred to as a folded-concave penalty; prominent examples of such function include the smoothly clipped absolute deviation (Fan and Li, 2001) or minimax concave penalty (Zhang, 2010). Since ϕd is concave and differentiable, for all positive z and z ϕd(z) ϕd( z) + ϕ d( z)(z z). (11) The inequality (11) indicates that the first order Taylor expansion of a differentiable concave function ϕd provides a tight global upper bound at the expansion point z. Thus, we can construct a function that is a tight upper bound of the function fγ(u) gγ(u | u) = 1 2 x u 2 2 + γ (i,j) Ed wd,ij Ad,iju 2 + c, (12) where the constant c does not depend on u and wd,ij are weights that depend on u, namely wd,ij = ϕ d ( Ad,ij u 2) . (13) Note that if we take u to be the vectorization of the Tucker approximation of the data, vec( X), and ϕd(z) to be the following variation on the error function ϕd(z) = 1 n d P (i,j) Ed wd,ij 0 e τdω2dω, then the function given in (10) coincides with the Co Co objective using the prescribed Tucker derived Gaussian kernel weights. The function gγ(u | u) is said to majorize the function fγ(u) at the point u (Lange et al., 2000) and minimizing it corresponds to performing one-step of the local linearapproximation algorithm (Zou and Li, 2008; Schifano et al., 2010) which is a special case of the majorization-minimization (MM) algorithm (Lange et al., 2000). The corresponding MM algorithm would consist of repeating the following two steps: (i) using a previous Co Co estimate U to compute weights wd,ij according to (13), and (ii) computing a new Co Co estimate using the new weights. In practice, we have found one-step to be adequate, however. Indeed, Zou and Li (2008) showed that the solution to the one-step algorithm was often sufficient in terms of its statistical estimation accuracy. Chi, Gaines, Sun, Zhou, and Yang 7. Other Practical Issues In this section, we address other considerations for using the method in practice, namely how to choose the tuning parameter γ and how to recover the partitions along each mode from the Co Co estimator ˆU. 7.1 Choosing γ The first major practical consideration is how to choose γ to produce a final co-clustering result. Since co-clustering is an exploratory method, it may be suitable for a user to manually inspect a sequence of Co Co estimators ˆUγ for a range of γ and use domain knowledge tied to a specific application to select γ to recover a co-clustering assignment of a desired complexity. Since this approach is time consuming and requires expert knowledge, an automated, data-driven procedure for selecting γ is desirable. Cross-validation (Stone, 1974; Geisser, 1975) and stability selection (Meinshausen and B uhlmann, 2010) are popular techniques for tuning parameter selection, but since both methods are based on resampling, they are unattractive in the tensor setting due to the computational burden. We turn to the extended Bayesian Information Criterion (e BIC) proposed by Chen and Chen (2008, 2012), as it does not rely on resampling and thus is not as computationally costly as crossvalidation or stability selection. e BIC(γ) = n log RSSγ + 2dfγ log(n), where RSSγ is the residual sum of squares X ˆUγ 2 F and dfγ is the degrees of freedom for a particular value of γ. We use the number of co-clusters in the Co Co estimator ˆUγ as an estimate of dfγ, which is consistent with the spirit of degrees of freedom since each cocluster mean is an estimated parameter. This criterion balances between model fitting and model complexity, and a similar version has been commonly employed in tuning parameter selection of tensor data analysis (Zhou et al., 2013; Sun et al., 2017). The e BIC is calculated on a grid of values S = {γ1, γ2, . . . γs}, and we select the optimal γ, denoted γ , which corresponds to the smallest value of the e BIC over S, namely γ = arg min γ S e BIC(γ). 7.2 Recovering the Partitions along Each Mode The second major practical consideration is how to extract the partitions from the Co Co estimator ˆU. Recall that the ith and jth mode-d subtensors belong to the same partition if vd,ij = U d ij = 0. Conversely, the ith and jth mode-d subtensors do not belong to the same partition if vd,ij = 0. Thus, a mode-d partition consists of the maximal set of mode-d subarrays such that for any pair i and j in this collection vd,ij = 0. We can automatically identify these maximal sets by extending a simple procedure employed by Chi and Lange (2015) for extracting clusters in the convex clustering problem. Identifying partitions along the dth mode is equivalent to finding connected components of a graph, where each node corresponds to a subarray along the dth mode, and there is an edge between nodes i and j if and only if vd,ij = 0. Provable Convex Co-clustering of Tensors We would like to read offwhich centroids have fused as the amount of regularization increases, namely determine partition assignments as a function of γ. Such assignments can be performed in O(nd) operations, using the differences variable Vd. We simply apply breadth-first search to identify the connected components of the following graph induced by the Vd. The graph identifies a node with every data point and places an edge between the lth pair of points if and only if vl = 0. Each connected component corresponds to a partition. Note that the graph constructed to determine partitions is not the same as the graph described in Section 3 with illustrative examples in Figure 2. We emphasize that the recovered partition along each mode does not depend on the ordering of the input data X, since it is based offof the pairwise differences along each mode, namely Vd for d = 1, . . . , D. Finally, we note that due to finite precision limitations, the difference variables vd,ij will likely not be exactly 0. In Appendix E.4, we detail a simple and principled procedure for ensuring sparsity in these difference variables. 8. Simulation Studies To investigate the performance of the Co Co estimator in identifying co-clusters in tensor data, we first explore some simulated examples. We compare our Co Co estimator to a kmeans based approach that is representative of various tensor generalizations of the spectral clustering method common in the tensor clustering literature (Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Wu et al., 2016b). We refer to this method as CPD+k-means. The CPD+k-means method (Papalexakis et al., 2013; Sun and Li, 2019) first performs a rank-R CP decomposition on the D-way tensor X to reduce the dimensionality of the problem, and then independently applies k-means clustering to the rows of each of the D factor matrix from the resulting CP decomposition. The k-means algorithm has also been used to cluster the factor matrices resulting from a Tucker decomposition (Acar et al., 2006; Sun et al., 2006; Kolda and Sun, 2008; Sun et al., 2009; Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Cao et al., 2015; Oh et al., 2017). We also considered this Tucker+kmeans method in initial experiments, but its co-clustering performance was inferior to that of CPD+k-means so we only report co-clustering performance results for CPD+k-means in the comparison experiments that follow. Note, however, that we still use the Tucker decomposition to compute Co Co weights wd,ij as described Section 6. Both Co Co and CPD+kmeans account for the multiway structure of the data. To assess the importance of accounting for this structure, we also include comparisons with the Co Te C method (Jegelka et al., 2009), which applied k-means clustering along each mode and does not account for the multiway structure of the data. All methods being compared have tuning parameters that need to be set. For the rank of the CP decomposition needed in CPD+k-means, we consider R {2, 3, 4, 5} and use the tuning procedure in Sun et al. (2017) to automatically select the rank. A CP decomposition is then performed using the chosen rank, and those factor matrices are the input into the k-means algorithm. A well known drawback of k-means is that the number of clusters k needs to be specified a priori. Several methods for selecting k have been proposed in the literature, and we use the gap statistic developed by Tibshirani et al. (2001) to select an optimal k from the specified possible values. Since Co Co estimates an entire solution path of mode-clustering results, ranging from nd clusters to a single cluster along mode d, Chi, Gaines, Sun, Zhou, and Yang 1 2 3 4 5 6 7 8 9 10 Noise Level, σ Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C Figure 4: Checkerbox Simulation Results: Impact of Noise Level. Two balanced clusters per mode across different levels of homoskedastic noise for n1 = n2 = n3 = 60. For each method, the confidence interval is calculated as the mean value plus/minus one standard error. we consider a rather large set of possible k values to make the methods more comparable. Appendix G gives a more detailed description of the CPD+k-means procedure and the selection of its tuning parameters. Co Te C, which applies k-means clustering along each mode independently, also requires specifying the number of cluster along each mode. As in CPD+k-means, we also select this parameter along each mode using the gap statistic. As described in Section 6, we employ a Tucker approximation to the data tensor in constructing weights wd,ij. In computing the Tucker decomposition we used one of two methods for selecting the rank. In the plots within this section, TD1 denotes the results where the Tucker rank was chosen using the SCORE algorithm (Yokota et al., 2017), while TD2 denotes results where the rank was chosen using a heuristic. Detailed discussion on these two methods are in Appendix F. The results presented in this section report the average Co Co estimator performance quantified by the ARI across 200 simulated replicates. All simulations were performed in Matlab using the Tensor Toolbox (Bader et al., 2015). All the following plots, except the heatmaps in Figure 13, were made using the open source R package ggplot2 (Wickham, 2009). 8.1 Cubical Tensors, Checkerbox Pattern For the first and main simulation setting, we study clustering data in a cubical tensor generated by a basic checkerbox mean model according to Assumption 4.2. Each entry in the observed data tensor is generated according to the underlying model (2) with independent errors ϵi1i2i3 N(0, σ2 r1r2r3). Unless specified otherwise, there are two true clusters along each mode for a total of eight underlying co-clusters. 8.1.1 Balanced Cluster Sizes and Homoskedastic Noise To get an initial feel for how the different co-clustering methods perform at recovering the true underlying checkerbox structure, we first consider a situation where the clusters corresponding to the two classes along each mode are all equally-sized, or balanced, and Provable Convex Co-clustering of Tensors 104 105 106 n = n1n2n3 Co Co CPD + kmeans Figure 5: Timing Results: Balanced Cluster Size and Homoskedastic Noise. Two balanced clusters per mode with a fixed level of homoskedastic noise for n1 = n2 = n3 = 20, 30, 60, and 100. Vertical and horizontal axes are on a log scale. share the same error variance, namely σr1r2r3 = σ for all r1, r2, and r3. The average coclustering performance for this setting in a tensor with dimensions n1 = n2 = n3 = 60 are given in Figure 4 for different noise levels. Figure 4 shows that all three methods perform well when the noise level is low (σ = 1). As the noise level increases, however, CPD+k-means experiences an immediate and noticeable drop offin performance. Co Te C s performance decays even more rapidly highlighting the importance of accounting for multiway structure. The Co Co estimator, on the other hand, is able to maintain near-perfect performance until the noise level becomes rather high (σ = 8). Figure 5 shows how the run times of Co Co and CPD+k-means vary as the size of a cubic tensor, n = n1n2n3 with n1 = n2 = n3 takes on the values 203, 303, 603, and 1003. These run times include all computations needed to fit and select a final model. For Co Co, a sequence of models were fit over a grid of γ parameters, and a final γ parameter was chosen using the e BIC. For CPD+k-means, a sequence of models were fit over a grid of possible (k1, k2, k3) parameters corresponding to the 3 factor matrices, and a final triple of (k1, k2, k3) parameters were chosen using the gap statistic. Timing comparisons were performed on a 3.2 GHz quad-core Intel Core i5 processor and 8 GB of RAM. The run time for Co Co scales linearly in the size of the data tensor as expected, namely proportionately with n3 1. Nonetheless, as also might be expected, the clustering performance enjoyed by Co Co does not come for free, and the simpler but less reliable CPD+k-means algorithm enjoys a better scaling as the tensor size grows. Timing results were similar for the following experiments and are omitted for space considerations. 8.1.2 Imbalanced Cluster Sizes When comparing clustering methods, one factor of interest is the extent to which the relative sizes of the clusters impact clustering performance. To investigate this, we again use a cubical tensor of size n1 = n2 = n3 = 60 but introduce different levels of cluster size imbalance along each mode, which we quantify via the ratio of the number of samples in cluster 2 of mode d and the total number of samples along mode d, for d = 1, 2, 3. Figure 6a Chi, Gaines, Sun, Zhou, and Yang 0.5 0.4 0.3 0.2 0.1 Size Ratio, nd2 nd Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (a) Low Noise 0.5 0.4 0.3 0.2 0.1 Size Ratio, nd2 nd Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (b) High Noise Figure 6: Checkerbox Simulation Results: Impact of Cluster Size Imbalance. Two imbalanced clusters per mode with either low or high homoskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6. shows that when the noise level is low, CPD+k-means is unaffected by the imbalance until the size of cluster 2 is less than 30% of the mode s length. At this point, the performance of CPD+k-means drops offsignificantly and it performs as well as a random clustering assignment when the sizes are highly skewed (nd2/nd = 0.1). The Co Co estimator is more or less invariant to the imbalance, and its performance is almost perfect across all levels of cluster size imbalance. Figure 6b shows that the Co Co estimator exhibits a slight deterioration in performance only when the cluster size ratio is 0.1 in the high noise case. In both low and high noise scenarios, Co Te C performs poorly. 8.1.3 Heteroskedastic Noise Another factor of interest is how the clustering methods perform when there is heteroskedasticity in the variability of the two classes. Figure 7 displays the co-clustering performance for different degrees of heteroskedasticity, as measured by the standard deviation for class 2 relative to class 1 s standard deviation, σ2/σ1. In the low noise setting, the Co Co estimator is immune to the heteroskedasticity until the noise levels differ by a factor of 4. CPD+kmeans in contrast is very sensitive to a deviation from homoskedasticty, experiencing a decline even when the noise ratio increases from 1 to only 1.5. The Co Co estimator fares worse in the high noise setting and also has a drop in performance with a small deviation from homoskedasticty. Once class 2 s standard deviation is more than double the standard deviation for class 1, all three methods are essentially the same as random clustering. This result is not terribly surprising since, in the high noise setting, this would result in one class having a very high standard deviation of σ2 = 12. In both low and high noise scenarios, Co Te C performs poorly. Provable Convex Co-clustering of Tensors 1 1.5 2 2.5 3 4 Noise Ratio, σ2 σ1 Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (a) Low Noise 1 1.5 2 2.5 3 4 Noise Ratio, σ2 σ1 G GG G GGGG 0.00 Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (b) High Noise Figure 7: Checkerbox Simulation Results: Impact of Heteroskedasticity. Two balanced clusters per mode with either low or high heteroskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ1 = 3 while high noise refers to σ1 = 6. 8.1.4 Different Clustering Structures So far, we have considered only a simple situation where there are exactly two true clusters along each mode, for a total of eight triclusters. Another factor of practical importance is how the clustering methods perform when there are more than two clusters per mode, and also when the number of clusters along each mode differs. We investigate both of these settings in this section. As before, the tensor is a perfect cube with n1 = n2 = n3 = 60 observations along each mode and an underlying checkerbox pattern. To gauge the performance, we again focus the attention on how the methods perform in the presence of both low and high noise. Low High Low High (3, 3, 3) (2, 3, 4) Noise: Clusters: Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Tec Figure 8: Checkerbox Simulation Results: Impact of Clustering Structure. Different balanced clusters per mode with either low or high homoskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6. Chi, Gaines, Sun, Zhou, and Yang The first situation studied is one in which there are three true clusters along each mode, resulting in a total of 27 triclusters. The left hand side of the graphs in Figure 8 show the results from this simulation setting. The graphs show that Co Co estimator consistently outperforms CPD+k-means and Co Te C in this setting across both noise levels. The Co Co estimator is able to recover the true co-clusters almost perfectly, while CPD+k-means struggles to handle the increased number of clusters per mode. We also investigated the clustering performance when the number of clusters per mode varies. In this setting, there are two, three, and four clusters along modes one, two, and three, respectively. From the right hand side of the graphs in Figure 8, we can see that the results are similar to the situation with three clusters per mode. CPD+k-means again performs very poorly across both noise levels, while convex co-clustering is again able to essentially recover the true co-clustering structure. Compared to the setting with three clusters per mode, CPD+k-means performs slightly worse in the face of a more complex clustering structure, while convex co-clustering is able to handle it in stride. These results bode well for convex co-clustering as the basic clustering structure of only two clusters per mode is unlikely to be observed in practice. 8.2 Rectangular Tensors Up to this point, to get an initial feel for Co Co s performance, we restricted our attention to cubical tensors with the same number of observations per mode so as to avoid changing too many factors at once. It is unlikely that the data tensor at hand will be a perfect cube, however, so it is important to understand the clustering performance when the methods are applied to rectangular tensors. Now we turn to cluster a rectangular tensor with one short mode and two longer modes. Two additional simulations involving rectangular tensors can be found in Appendix H. Figure 9 shows that Co Co performs very well and better than CPD+k-means and Co Te C at the lower noise level (σ = 3) but has a sharp decrease in ARI at the higher noise level (σ = 4). The decline is more pronounced for the longer modes (Figure 9b and Figure 9a) as the short mode (Figure 9a) is still able to maintain perfect performance despite the increase in noise. This is not surprising, since the shorter mode has effectively more samples. Moreover, we see the blessing of dimensionality at work when the number of samples along the short mode are doubled (n1 = 20, n2 = n3 = 50), the performance along the two longer modes improves drastically in the high noise setting. We finally note that, along the shorter mode, the use of the heuristic in determining the rank of the Tucker decomposition for calculating the weights performs better than the SCORE algorithm method along modes 1 and 2, though ultimately the co-clustering performance is comparable. This may indicate that the SCORE algorithm struggles to correctly identify the optimal Tucker rank for short modes in the presence of relatively higher noise, while the heuristic is more immune to the noise level as it is based simply on the dimensions of the tensor. 8.3 CANDECOMP/PARAFAC Model In Section 8.1, we saw that the Co Co estimator performs well and typically better than CPD+k-means when clustering tensors whose co-clusters have an underlying checkerbox Provable Convex Co-clustering of Tensors 3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 = Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (a) Adjusted Rand Index, Mode 1 3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 = Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (b) Adjusted Rand Index, Mode 2 3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 = Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (c) Adjusted Rand Index, Mode 3 3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 = Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C (d) Adjusted Rand Index, Triclusters Figure 9: Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with one short mode and two longer modes. Average adjusted rand index plus/minus one standard error for different noise levels and mode lengths. pattern. To evaluate the performance of our Co Co estimator under model misspecification, we consider the generative model as the following CP decomposition model. We first construct the factor matrix A R80 2 and construct the following rank-2 CP means tensor i=1 ai ai ai, where denotes the outer product. We then added varying levels of Gaussian noise to the U to generate the observed data tensor. We consider two different types of factor matrices. As shown in Figure 10, one shape consists of two half-moon clusters (Hocking et al., 2011; Chi and Lange, 2015; Tan and Witten, 2015) while the other shape contains a bullseye, similar to the two-circles shape studied by Ng et al. (2002) and Tan and Witten (2015). In either case, the triangles in Figure 10 correspond to the first 40 rows of A, whereas the Chi, Gaines, Sun, Zhou, and Yang (a) Bullseye G G G G G G G G G G G G GGGGGGGGG G G G G G G G G G G G G G G G G G G G (b) Half Moons Figure 10: Factor Matrices for the CP Models. circles correspond to the second 40 rows of A. Note that this data generating mechanism should favor the CPD+k-means method. Figure 11 shows the simulation results for using the CP model with these two nonconvex shapes generating the data. The discrepancy in performance between the Co Co estimator and the other two methods is quite large. The Co Co estimator almost perfectly identifies the true co-clusters. In contrast, both CPD+k-means and Co Te C perform very poorly, even when the noise variance is small. The poor performance of CPD+k-means and Co Te C are not completely surprising as other have noted the difficulty that k-means methods have in recovering non-convex clusters (Ng et al., 2002; Hocking et al., 2011; Tan and Witten, 2015). These results give us some assurances that the Co Co estimator is able to still perform well even under some model misspecification since the true co-clusters do not have a checkerbox pattern. 8.4 Comparison with Convex Biclustering It is natural to ask how much additional gain there is in using Co Co over convex biclustering (Chi et al., 2017) on the matricizations of a data tensor. To answer this question, we compare Co Co to the following strategy for applying convex biclustering to estimate coclusters. We explain the strategy for a 3-way tensor; the generalization to D-way tensors is straightforward. We first matricize the tensor X along mode-1 to obtain the matrix X(1), apply convex biclustering on X(1), and retain the mode-1 clustering results. Note that the mode-2 and mode-3 fibers have been mixed together through the matricization process. We then repeat the two-step procedure for mode-2 and mode-3. The final cocluster estimates are obtained by taking the cross-products of the mode-1, mode-2, and mode-3 cluster assignments. We consider two illustrative scenarios to understand the value of preserving the full multiway structure with Co Co: a balanced case and imbalanced case. In the balanced Provable Convex Co-clustering of Tensors Bullseye Half Moons Data Generation Adjusted Rand Index Co Co TD1 Co Co TD2 CPDkmeans Co Te C Figure 11: CP Model Simulation Results. Two balanced clusters per mode with low homoskedastic noise for n1 = n2 = n3 = 40. Bullseye and Half Moons refer to the shape embedded in the factor matrices used to generate the true tensor. 1 2 3 4 5 6 7 8 9 Noise Level, σ GGGG GGGG G G G GGGG GGG G Adjusted Rand Index Co Co TD1 Co Co TD2 Biclust TD1 Biclust TD2 (a) Balanced 1 2 3 4 5 6 7 8 9 Noise Level, σ GG GG GG GG GG Adjusted Rand Index Co Co TD1 Co Co TD2 Biclust TD1 Biclust TD2 (b) Imbalanced Figure 12: A Comparison between Co Co and Convex Biclustering Average Adjusted Rand Index plus/minus one standard error for different noise levels. case, we have a 3-way data tensor X R60 60 60 with two clusters along each mode, where clusters are of equal size and homoskedastic iid Gaussian noise has been added to all elements of the tensor. This scenario is similar to the one shown in Figure 4. In the imbalanced case, we have a 3-way data tensor X R30 40 80. There are two clusters along mode-1 of sizes 10 and 20, three clusters along mode-2 of sizes 8, 12, and 20, and four clusters along mode-3 of sizes 5, 10, 20, and 45. Homoskedastic iid Gaussian noise has been added to all elements of the tensor. Finally, we note that the empirical performance of convex biclustering, like that of Co Co s, depends on choosing good weights for the rows and columns of the input data matrix (Chi et al., 2017). To create a fair comparison, we construct convex biclustering weights based offof the same TD1 and TD2 denoising procedure used for Co Co, putting the preprocessing for both methods on equal footing. Chi, Gaines, Sun, Zhou, and Yang Figure 12a and Figure 12b show the co-clustering performance of Co Co and the convex biclustering method in the balanced and imbalanced cases respectively. We see that in the balanced case, Co Co s performance is marginally better than that of the convex biclustering method. On the other hand, we see that in the imbalanced case, Co Co s performance degrades more gracefully than that of the convex biclustering method as the noise level increases. The example illustrates that Co Co has better co-cluster recovery when there is more imbalance in the data tensor - the aspect ratios of the tensor dimensions are more skewed and the number of clusters and the cluster sizes are more heterogenous. The key formulation difference between Co Co and the convex biclustering method that provides some insight into these two results is that Co Co imposes a finer level of smoothness that respects the multi-way structure in the data tensor. Imposing such finer level of smoothness imparts greater robustness in the presence of increasing noise to recovering the smaller co-clusters in the imbalanced scenario. An added incentive for using Co Co and preserving the multiway structure in the data is that the gains in co-cluster recovery over the convex biclustering method do not come at a greater computational cost. Note that the computational complexity of convex biclustering is O(n), using sparse weights for the row and column similarity graphs. For a D-way tensor, the computational complexity then becomes O(Dn), which is the same as the computational complexity of Co Co applied directly on the D-way tensor. To summarize, in comparison to the convex biclustering method, Co Co (i) does not come at additional computational costs, (ii) can recover underlying co-clustering structure in imbalanced scenarios which are more likely to be encountered in practice, and (iii) has the ability to consistently recover an underlying co-clustering structure according to Theorem 9, with even a single tensor sample, which is a typical case in real applications. Since this phenomenon does not exist in vector or matrix variate cluster analysis, the convex biclustering method lacks this theoretical guarantee. 9. Real Data Application Having studied the performance of the Co Co estimator in a variety of simulated settings, we now turn to using the Co Co estimator on a real data set. The proprietary data set comes from a major online company and contains the click-through rates for advertisements displayed on the company s webpages from May 19, 2016 through June 15, 2016. The clickthrough rate is the number of times a user clicks on a specific advertisement divided by the number of times the advertisement was displayed. The data set contains information on 1000 users, 189 advertisements, 19 publishers, and 2 different devices, aggregated across time. Thus, the data forms a fourth-order tensor where each entry in the tensor corresponds to the click-through rate for the given combination of user, advertisement, publisher, and device. Here a publisher refers to a different webpage within the online company s website, such as the main home page versus a page devoted to either breaking news or sports scores. The two device types correspond to how the user accessed the page, using either a personal computer or a mobile device such as a cell phone or tablet computer. The goal in this real application is to simultaneously cluster users, advertisements, and publishers to improve user behavior targeting and advertising planning. Provable Convex Co-clustering of Tensors In the click-through rate tensor data, over 99% of the values are missing since one user likely has seen only a handful of the possible advertisements. If a specific advertisement is never seen by a user, it is considered as a missing value. Since the proposed Co Co estimator can only handle complete data, we first preprocess the data by imputing the missing values before any clustering can be done. To impute the missing entries, we use the CP-based tensor completion method Jain and Oh (2014) and tune its rank via the information criterion proposed by Sun et al. (2017). This tuning method chooses the optimal rank as R = 20 from the rank list {1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22}. Finally, the imputed values are truncated to ensure all the values of the tensor are within 0 and 1 since click-through rates are proportions. One mode of the fourth-order tensor has only two observations and those observations already have a natural grouping (device type). Therefore, for the sake of clustering we analyze the devices separately. We compare our method with CPD+k-means. Furthermore, the tuning parameter for convex co-clustering is automatically selected using the e BIC (Section 7.1) while the number of clusters in CPD+k-means is chosen via the gap statistic (Tibshirani et al., 2001). We do not include comparisons with Co Te C given its poor performance in the simulation experiments. We first look at the clustering results from clustering the click-through rates for users accessing the advertisements through a personal computer (PC). Table 1 contains the number of clusters identified as well as the sizes of the clusters, while Figure 13a visualizes the advertisement-by-publisher biclusters for a randomly selected user. As to be expected, the advertisement-by-publisher slices display a checkerbox pattern, which turns into a checkerbox pattern when the slices are meshed together. The clustering results for the users are omitted in this paper to ensure user privacy. However, co-clustering the tensor does not result in the loss of information that would occur if the tensor was converted into a matrix by averaging across users or flattening along one of the modes. Table 1 and Figure 13a show that the Co Co estimator identifies four advertisement clusters, with one cluster being much bigger than the others. The advertisements in this large cluster have click-through rates that are close to the grand average in the data set. One of the small clusters has very low click-through rates, while the other two clusters tend to have much higher click-through rates than the rest of the advertisements. On the other hand, CPD+k-means clusters the advertisements into 57 groups, which is less-useful from a practical standpoint. Many of the clusters are similarly-sized and contain only a few advertisements, likely due to the inability of CPD+k-means to handle imbalanced cluster sizes as was observed in the simulation experiments (Section 8.1.2). In terms of the publishers, the Co Co estimator identifies 3 clusters while CPD+k-means does not find any underlying grouping and simply identifies one big cluster, which again is not terribly useful (Table 1). We next provide some interpretations of the obtained clustering results of the publishers. One way online advertisers can reach more users is by entering agreements with other companies to route traffic to the advertiser s website. For example, Google and Apple have a revenue-sharing agreement in which Google pays Apple a percentage of the revenue generated by searches on i Phones (Mc Garry, 2016). Similarly, the online company being studied partners with several internet service providers (ISPs) to host the defaut home pages for the ISP s customers. It would make sense that these slightly different variants of the online company s main home Chi, Gaines, Sun, Zhou, and Yang Co Co Estimator CPD+kmeans Advertisements Publisher Advertisements Publisher Device # of clusters Cluster Sizes # of clusters Cluster Sizes # of clusters # of clusters PC 4 (156, 22, 8, 3) 3 (4, 3, 12) 57 1 Mobile 3 (145, 22, 22) 2 (7, 12) 49 13 Table 1: Advertising Data Clustering Results page would have similar click-through rates, and the Co Co estimator in fact assigned these variants into the same cluster. For users accessing the advertisements through a mobile device, such as a mobile phone or tablet computer, the Co Co estimator results for the advertisements are largely similar to the results for PCs (Table 1 and Figure 13b). There is one large cluster that contains click-through rates similar to the overall average, while the two other equally-sized clusters have relatively very low or very high click-through rates, respectively. The underlying click-through rates for the PC data have more variability than the mobile data, which is consistent with the identification of an additional cluster for the PC data. As before, CPD+k-means finds a large number of advertisement clusters, most of which are roughly the same size, again likely impacted by the imbalance in the cluster sizes. When compared to the personal computer device, one difference is that the cluster with the higher click-through rates for mobile devices is larger and has a higher average click-through rate than the similar clusters for the personal computer device. This finding is consistent with research by the Pew Research Center that found that click-through rates for mobile devices are higher than for advertisements viewed on a personal computer or laptop (Mitchell et al., 2012). It is also enlightening to take a closer look at the underlying advertisements clustered across the two devices. All of the advertisements clustered in the high click-through rate cluster for the mobile devices are in the average click-through rate cluster for personal computers. In taking a closer look at the ads in these clusters, there are several ads related to online shopping for personal goods, such as jeans, workout clothes, or neck ties. It makes sense to shop for these types of goods using a mobile device, such as while at work when it is not appropriate to do so on a work computer. Conversely, all of the advertisements in either of the two higher PC click-through rate clusters are in the large, average click-through rate cluster for the mobile devices. There are several financial-related ads in these two PC clusters, such as for mortgages or general investment advice. On the other hand, there are not many online shopping ads in those clusters, with the exception of more expensive technology-related goods that one may want to invest more time in researching before making a purchase. In terms of the publisher clusters on mobile device, Table 1 shows that the Co Co estimator identifies two clusters of publishers while CPD+k-means identifies 13 small clusters. Contrary to the advertisement clusters, the publisher clusters across both devices are very similar. In fact, the only difference is that the smaller cluster for the mobile device, which contains seven publishers, is split into two clusters for personal computers. This can be Provable Convex Co-clustering of Tensors (a) Personal Computers (b) Mobile Figure 13: Advertisement and Publisher Click-Through Rate Biclusters for a Randomly Selected User. The rows correspond to different advertisements and the columns correspond to different publishers. Darker blue corresponds to higher click-through rates for a given device. seen in the click-through rate heatmaps given in Figure 13 in looking at the right part of each heatmap. The publishers in these smaller clusters have higher click-through rates on average than those in the larger cluster. Additionally, five of the seven (71%) publishers in the high click-through rate clusters have stand-alone apps that display ads, while only three of the twelve (25%) publishers in the larger cluster do. For mobile devices, it has been observed that in-app advertisements have higher click-through rates and browser-based ads (Hof, 2014). We conjecture that this is also true for personal computer apps, which is consistent with the clustering results. Thus it again appears that the clusters identified by Co Co also make sense practically. 10. Discussion In this paper, we formulated and studied the problem of co-clustering of tensors as a convex optimization problem. The resulting Co Co estimator enjoys features in theory and practice that are arguably lacking in existing alternatives, namely statistical consistency, stability guarantees, and an algorithm with polynomial computational complexity. Through a battery of simulations, we observed that the Co Co estimator can identify co-clustering structures under realistic scenarios such as imbalanced co-cluster sizes, imbalanced number of clusters along each mode, heteroskedasticity in the noise distribution associated with each co-cluster, and even some violation of the checkerbox mean tensor assumption. We have leveraged the power of the convex relaxation to engineer a computationally tractable co-clustering method that comes with statistical guarantees. These benefits, however, do not come for free. The Co Co estimator incurs similar costs that using the lasso incurs as a surrogate for a cardinality constraint or penalty. It is well known that the lasso leads to parameter estimates that are shrunk towards zero. This shrinkage toward zero Chi, Gaines, Sun, Zhou, and Yang is the price for simultaneously estimating the support, or locations of the nonzero entries, in a sparse vector as well as the values of the nonzero entries. In the context of convex co-clustering, the Co Co estimator ˆU is shrunk towards the tensor X, namely the tensor whose entries are all equal to the average over all entries of X. The weights, however, play a critical role in reducing this bias. In fact, the weights can be seen as serving the same role as weights used in the adaptive lasso (Zou, 2006). There are several possible extensions and open problems that have been left for future work. First, we note that there is a gap between what our theory predicts and what seems possible from our experiments. Specifically, Theorem 9 assumes uniform weights for each mode, yet simulation experiments indicate that the Co Co estimator using Tucker derived Gaussian kernel weights (9) can significantly outperform the Co Co estimator using uniform weights. One open problem is to derive prediction error bounds that relax the uniform weights assumption. Second, although we have developed automatic methods for constructing the weights that work well empirically, other approaches to constructing the weights is a direction of future search. For example, other tensor approximation methods, such as the use of the ℓ1-norm to make the decomposition most robust to heavy tail noise as done by Cao et al. (2015), could possibly improve the quality of the weights. Third, in this paper we have focused on additive noise that is a zero-mean M-concentrated random variable. Real data, however, may not follow such a distribution motivating coclustering procedures that can handle outliers. To address potential robustness issues, the Co Co framework could be extended to handle outliers by swapping the sum of squared residuals term in (5) with an analogous Huber loss or Tukey s Biweight function. Finally, while our first order algorithm for co-clustering tensors scales linearly in the size of the data, data tensors inevitably will only increase in size motivating the need for more scalable algorithms for computing the Co Co estimator. A natural approach would be to adopt an existing distributed version of the proximal methods, such as one the methods proposed by Combettes and Pesquet (2011), Chen and Ozdaglar (2012), Li et al. (2013), or Eckstein (2017). Another natural approach would be to investigate if stochastic versions of the recently proposed generalized dual gradient ascent (Ho et al., 2019) could be adapted to compute the Co Co estimator. Additionally, in practice many data tensors that we would like to co-cluster may be very sparse. The first order algorithm presented here assumes the data tensor is dense. Consequently, an important direction of future work is to investigate alternative optimization algorithms that could leverage the sparsity structure within a data tensor. Acknowledgments The authors thank Xu Han for his help with the simulation experiments during the revision of this work. The authors also thank the action editor and three reviewers for their helpful comments and suggestions which led to a much improved presentation. Eric Chi acknowledges support from the National Science Foundation (DMS-1752692) and National Institutes of Health (R01GM135928). Will Wei Sun acknowledges support from the Office of Naval Research (ONR N00014-18-1-2759). Hua Zhou acknowledges support from Provable Convex Co-clustering of Tensors the National Institutes of Health (R01GM053275 and R01HG006139). Finally, this research collaboration was partially funded by the National Science Foundation under grant DMS-1127914 (the Statistical and Applied Mathematical Sciences Institute). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, the National Institutes of Health, or the Office of Naval Research. Appendix A. Tensor Decompositions We review two basic tensor decompositions that generalize the singular value decomposition (SVD) of a matrix: (i) the CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang, 1970; Harshman, 1970) and (ii) the Tucker decomposition (Tucker, 1966). Just as the SVD can be used to construct a lower-dimensional approximation to a data matrix, these two decompositions can be used to construct a lower dimensional approximation to a D-way tensor X Rn1 n2 n D The CP decomposition aims to approximate X by a sum of rank-one tensors, namely i=1 a(1) i a(2) i a(D) i , where represents the outer product and a(d) i is the ith column of the dth factor matrix A(d) Rnd R. The positive integer R denotes the rank of the approximation. For sufficiently large R, we can exactly represent X with a CP decomposition. The Tucker decomposition aims to approximate X by a core tensor H RR1 R2 RD multiplied by factor matrices along each of its modes, namely X H 1 A(1) 2 A(2) 3 D A(D) = i D=1 hi1i2 i Da(1) i1 a(2) i2 a(D) i D , where a(d) id is the idth column of the dth factor matrix A(d) Rnd Rd. Typically the columns of A(d) are computed to be orthonomal and can be interpreted as principal components or basis vectors for the dth mode. For sufficiently large R1, . . . , RD, we can exactly represent X with a Tucker decomposition. Appendix B. Proofs of Smoothness Properties B.1 Proof of Proposition 4 Without loss of generality, we can absorb γ into the weights matrices. Thus, we seek to show the continuity of ˆU with respect to (X, W1, . . . , WD). We use the following compact representation of the weights w = (vec(W1)T, vec(W2)T, . . . , vec(WD))T R PD d=1 (nd 2 ). We check to see if the solution ˆU is continuous in the variable ζ = (x T, w T)T. It is easy to verify that the following function is jointly continuous in U and ζ f(U, ζ) = 1 2 X U 2 F + R(U, w), Chi, Gaines, Sun, Zhou, and Yang i 0 and a sequence {ζ(m)} converging to a limit ζ such that U(m) U (ζ) F ϵ for all m where U(m) = arg min U f(U, ζ(m)). Since f(U, ζ) is strongly convex in U, the minimizer U(m) exists and is unique. Without loss of generality, we can assume ζ(m) ζ F 1. This fact will be used later in proving the boundedness of the sequence U(m). If U(m) is a bounded sequence, then we can pass to a convergent subsequence with limit U. Fix an arbitrary point U. Note that f(U(m), ζ(m)) f( U, ζ(m)) for all m. Since f is continuous in (U, ζ), taking limits gives us the inequality f( U, ζ) f( U, ζ). Since U was selected arbitrarily, it follows that U = U (ζ), which is a contradiction. It only remains for us to show that the sequence U(m) is bounded. Consider the function g(U) = sup ζ: ζ ζ F 1 1 2 X U 2 F + R w(U). Note that g is convex, since it is the point-wise supremum of a collection of convex functions. Since f(U, ζ(m)) g(U) and f is strongly convex in U, it follows that g(U) is also strongly convex and therefore has a unique global minimizer U such that g(U ) < . It also follows that f(U(m), ζ(m)) f(U , ζ(m)) g(U ) (14) for all m. By the reverse triangle inequality it follows that U(m) F X(m) F 2 1 2 U(m) X(m) 2 F f(U(m), ζ(m)). (15) Combining the inequalities in (14) and (15), we arrive at the conclusion that U(m) F X(m) F 2 g(U ), for all m. Suppose the sequence U(m) is unbounded, namely U(m) F . But since X(m) converges to X, the left hand side must diverge. Thus, we arrive at a contradiction if U(m) is unbounded. Provable Convex Co-clustering of Tensors B.2 Proof of Proposition 5 First suppose that U(d) = 1c T, namely all the mode-d subarrays of U are identical. Recall that Z = U d A if and only if Z(d) = AU(d). Therefore, Rd(U) = 0 since d,ij1c T = 0 for all (i, j) Ed. Now suppose that Rd(U) is zero. Take an arbitrary pair (i, j) with i < j. By Assumption 4.1, there exists a path i k l j along which the weights are positive. Let w denote the smallest weight along this path, namely w = min{wd,ik, . . . , wd,lj}. By the triangle inequality U d d,ij F U d d,ik F + + U d d,lj F. We can then conclude that w U d d,ij F Rd(U) = 0. It follows that e T i U(d) = e T j U(d), since w is positive. Since the pair (i, j) is arbitrary, it follows that all the rows of U(d) are identical or in other words, U(d) = 1c T for some c Rn d. B.3 Proof of Proposition 6 We will show that there is a γmax such that for all γ γmax, the grand mean tensor X is the unique global minimizer to the primal objective (4). We will certify that X is the solution to the primal problem by showing that the optimal value of a dual problem, which lower bounds the primal, equals Fγ( X). Note that the Lagrangian dual given in (28) is a tight lower bound on Fγ(U). 2 ATλ 2 2 + λ, Ax . For sufficiently large γ, the solution to the dual maximization problem coincides with the solution to the unconstrained maximization problem 2 ATλ 2 2 + λ, Ax , whose solution is λ = AAT Ax. Plugging λ into the dual objective gives an optimal value of 1 2 AT AAT Ax 2 2 = 1 2 x I AT AAT A x 2 2. Note that h I AT AAT A i is the projection onto the orthogonal complement of the column space of AT, which is equivalent to the null space or kernel of A, denoted Ker(A). We will show below that Ker(A) is the span of the all ones vector. Consequently, I AT AAT A x = 1 n x, 1 1. Chi, Gaines, Sun, Zhou, and Yang Note that the smallest γ such that λ Cγ is an upper bound on γmax. We now argue that Ker(A) is the span of 1 Rn. We rely on the following fact: If Φd is an incidence matrix of a connected graph with nd vertices, then the rank of Φd is nd 1 (Deo, 1974, Theorem 7.2). According to Assumption 4.1, the mode-d graphs are connected; it follows that Φd { 1, 0, 1}|Ed| nd has rank nd 1. It follows then that Ker(Φd) has dimension one. Furthermore, since each row of Φd has one 1 and one 1, it follows that 1 Ker(Φd) Rnd. A vector z Ker(A) if and only if z Ker(Ad) for all d. Recall that the rank of the Kronecker product A B is the product of the ranks of the matrices A and B. This rank property of Kronecker products of matrices implies that the dimension of Ker(Ad) equals n d. Let bi = 1n D 1nd+1 ei 1nd 1 1n1 where 1p Rp is the vector of all ones and ei Rnd is the ith standard basis vector. Then that the set of vectors B = {b1, b2, . . . , bnd} forms a basis for Ker(Ad). Take an arbitrary element from Ker(Ad), namely a vector of the form 1n a 1n , where n = QD j=d+1 nj and n = Qd 1 j=1. We will show that in order for 1n a 1d Ker(I Φd), a must be a multiple of 1nd. Consider the relevant matrix-vector product = (1n D 1nd+1 Φda 1nd 1 1n1). Therefore, Ad 1n a 1n = 0 if and only if Φda = 0. But the only way for Φda to be zero is for a = c1nd for some c R. Thus, Ker(A) is the span of 1n. B.4 Proof of Proposition 7 Note that ˆU is the proximal mapping of the closed, convex function Then ˆU is firmly nonexpansive in X (Combettes and Wajs, 2005, Lemma 2.4). Finally, firmly nonexpansive mappings are nonexpansive, which completes the proof. Appendix C. Proof of Theorem 9 We first prove some auxiliary lemmas before proving our prediction error result. C.1 Auxiliary Lemmas The following lemma considers the concentration of a random quadratic form y TBy for a M-concentrated random vector y and a deterministic matrix B (Vu and Wang, 2015). It can be viewed as a generalization of the standard Hanson and Wright inequality for the quadratic forms of independent sub-Gaussian random variables (Hanson and Wright, 1971). Lemma 11 Let y Rn be a M-concentrated random vector, see Definition 8. Then there are constants C, C > 0 such that for any matrix B Rn n, P |y TBy tr(B)| t C log(n) exp C M 2 min t2 B 2 F log(n), t B 2 Provable Convex Co-clustering of Tensors The next lemma studies the properties of the matrix Ad,ij, defined in (6), in the penalty function. Denote Sd as the matrix constructed by concatenating Ad,ij, i < j vertically. That is, Sd = AT d,12 AT d,13 . . . AT d,nd 1nd T R[(nd 2 )n d] n. (16) Lemma 12 For each d = 1, . . . , D, the rank of the matrix Sd is (nd 1)n d. Denote σmin(Sd) and σmax(Sd) as the minimum non-zero singular value and maximum singular value of Sd, respectively. We have σmin(Sd) = σmax(Sd) = nd. The proof of Lemma 12 follows from Lemma 1 in Tan and Witten (2015) and is omitted. According to Lemma 12, we can construct a singular value decomposition of Sd = UdΛd VT d , where Ud R[(nd 2 )n d] (nd 1)n d, Λd R(nd 1)n d (nd 1)n d, and Vd Rn (nd 1)n d. Denote Gd = UdΛd R[(nd 2 )n d] (nd 1)n d, (17) and its pseudo-inverse as G d R(nd 1)n d [(nd 2 )n d]. The following lemma studies the properties of Gd and G d, for each d = 1, . . . , D. Lemma 13 For each d = 1, . . . , D, the rank of the matrix Gd is (nd 1)n d. The minimal non-zero singular value and maximal singular value of Gd are σmin(Gd) = σmax(Gd) = nd. Moreover, σmin(G d) = σmax(G d) = 1/ nd. Lemma 13 follows directly from the conclusions in Lemma 12. C.2 Proof of Main Theorem We first reformulate our optimization problem via a decomposition approach to simplify the theoretical analysis. Such strategy was developed in Liu et al. (2013a) and has been successfully applied in Tan and Witten (2015); Wang et al. (2018). Denote γd = γ/nd. Our convex tensor co-clustering method is equivalent to solving ˆu = arg min u 1 2 x u 2 2 + (i,j) Ed Ad,iju 2 According to the definition of Sd in (16), we define the penalty function R( ) such that (i,j) Ed Ad,iju 2. According to the singular value decomposition of Sd = UdΛd VT d , there exists a matrix Wd Rn n d such that Vd = [Wd, Vd] Rn n is an orthogonal matrix and WT d Vd = 0. Let αd = WT d u Rn d and βd = VT d u Rn. Clearly, we have Wdαd + Vdβd = Wd WT d u + Vd VT d u = Vd V T d u = u, (19) Chi, Gaines, Sun, Zhou, and Yang for any d = 1, . . . , D. This fact together with the definition of Gd = UdΛd in (17) imply that solving our convex tensor clustering in (18) is equivalent to solving min αd,βd,d=1,...,D 2D x Wdαd + Vdβd 2 2 + γd R(Gdβd) (20) Denote the solution of (20) as ˆαd, ˆβd, d = 1, . . . , D, which corresponds to the estimator ˆu in (18) according to (19). Similarly, we denote the true parameters as α d, β d that corresponds to u defined in Assumption 4.2. Our goal is to derive the upper bound of ˆu u 2 2 by above reparametrization. Since ˆαd, ˆβd, d = 1, . . . , D minimizes the objective function in (20), we have 2D x Wd ˆαd + Vdˆβd 2 2 + γd R(Gdˆβd) 2D x Wdα d + Vdβ d 2 2 + γd R(Gdβ d) . Note that x ˆu 2 2 x u 2 2 = ˆu 2 2 u 2 2 2x T(ˆu u ) = ˆu u 2 2 + 2ϵT(ˆu u ), where the last equality is due to the model assumption x = u + ϵ. Therefore, we have 1 2 ˆu u 2 2 + d=1 γd R(Gdˆβd) 1 2D d=1 ϵT(u ˆu) + d=1 γd R(Gdβ d) ϵT[Wd(α d ˆαd) + Vd(β d ˆβd)] | {z } d=1 γd R(Gdβ d). (21) Next we derive the bound for f(ˆαd, ˆβd). Note that the optimization over αd in (20) has a closed-form since the penalty term is independent of αd. In particular, by setting the derivative of x Wdαd + Vdβd 2 2 with respect to αd to be zero, we obtain that αd = WT d (x Vdβd). This implies that ˆαd = WT d (x Vdˆβd) = WT d (Wdα d + Vdβ d + ϵ Vdˆβd) (22) = α d + WT d ϵ, where the second equality is due to x = u + ϵ and the last equality is due to the fact that WT d Vd = 0 and WT d Wd = I. According to (22), we have f(ˆαd, ˆβd) = ϵTWd WT d ϵ + ϵTVd(β d ˆβd)] ϵTWd WT d ϵ | {z } (I) + ϵTVd(β d ˆβd)] | {z } (II) Provable Convex Co-clustering of Tensors Bound (I): We apply the concentration inequality in Lemma 11 to bound (I). It remains to compute Wd WT d 2 F and Wd WT d 2. By construction, Wd WT d Rn n is a projection matrix since Vd V T d = Wd WT d + Vd VT d = I. Therefore, the rank of Wd WT d is Q j =d nj, Wd WT d 2 F = Q j =d nj, Wd WT d 2 = 1, and tr(Wd WT d ) = Q j =d nj. Denote n = QD d=1 nd. By Lemma 11 and Assumption 4.2, we have P ϵTWd WT d ϵ t + n d C log(n) exp C M 2 min t2 log(n)n d , t . Setting t = p n d log(n)2, we have P ϵTWd WT d ϵ log(n) n d + n d C exp n log log(n) C M 2 log(n) o , (24) where the right hand side converges to zero as the dimension n = QD d=1 nd . Note that our error ϵ in Assumption 4.2 is assumed to be a M-concentrated random variable. If we assume a stronger condition such that ϵ is a vector with iid sub-Gaussian, we can obtain a upper bound p log(n)n d + n d according to the Hanson and Wright inequality (Hanson and Wright, 1971). Therefore, in spite of the relaxation in the error assumption, our bound in (24) is only up to a log-term larger. Bound (II): By definitions of Gd in (17) and G d, we have G d Gd = I. Furthermore, let G d,ij refer to the column of G d that corresponds to the index (i, j), and let Gd,ij refer to the row of Gd that corresponds to the index (i, j). We have (II) = ϵTVd(β d ˆβd) = ϵTVd G d Gd(β d ˆβd) = i 0. Hence with probability at least 1 C3/n, we have n d log(n) log nd 2 Plugging the results in (24) and (25) into (23), we obtain that, for each d = 1, . . . , D f(ˆαd, ˆβd) log(n) n d + n d + n d log(n) log nd 2 i