# provable_convex_coclustering_of_tensors__4ecc5e1b.pdf

Journal of Machine Learning Research 21 (2020) 1-58 Submitted 3/18; Revised 2/20; Published 10/20

Provable Convex Co-clustering of Tensors

Eric C. Chi eric chi@ncsu.edu Department of Statistics North Carolina State University Raleigh, NC 27695, USA

Brian R. Gaines brian.gaines@sas.com Advanced Analytics R&D SAS Institute Inc. Cary, NC 27513, USA

Will Wei Sun sun244@purdue.edu Krannert School of Management Purdue University West Lafayette, IN 47907, USA

Hua Zhou huazhou@ucla.edu Department of Biostatistics University of California Los Angeles, CA 90095, USA

Jian Yang jianyang@oath.com Advertising Sciences Yahoo Research Sunnyvale, CA 94089, USA

Editor: Francis Bach

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientiﬁc and business applications. Moreover, there is a gap between statistical guarantees and computational eﬃciency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (Co Co) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the Co Co estimator, which reveals a surprising blessing of dimensionality phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical ﬁndings are supported by extensive simulated studies. Finally, we apply the Co Co estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising eﬀectiveness.

Keywords: Clustering, Fused lasso, High-dimensional Statistical Learning, Multiway Data, Non-asymptotic Error

c 2020 Eric C. Chi, Brian R. Gaines, Will Wei Sun, Hua Zhou, and Jian Yang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-155.html.

Chi, Gaines, Sun, Zhou, and Yang

1. Introduction

In this work, we study the problem of ﬁnding structure in multiway data, or tensors, via clustering. Tensors appear frequently in modern scientiﬁc and business applications involving complex heterogeneous data. For example, data in a neurogenomics study of brain development consists of a 3-way array of expression level measurements indexed by gene, space, and time (Liu et al., 2017). Other examples of 3-way data arrays consisting of matrices collected over time include email communications (sender, recipient, time) (Papalexakis et al., 2013), online chatroom communications (user, keyword, time) (Acar et al., 2006), bike rentals (source station, destination station, time) (Guigour es et al., 2015), and internet network traﬃc (source IP, destination IP, time) (Sun et al., 2006). The rise in tensor data has created new challenges in making predictions, such as in recommender systems for example (Zheng et al., 2016; Symeonidis, 2016; Symeonidis and Zioupos, 2016; Frolov and Oseledets, 2017; Bi et al., 2018) as well as inferring latent structure in multiway data (Acar and Yener, 2009; Anandkumar et al., 2014; Cichocki et al., 2015; Sidiropoulos et al., 2017). As tensors become increasingly more common, the need for a reliable co-clustering method grows increasingly more urgent. Prevalent clustering methods, however, mainly focus on vector or matrix-variate data. The goal of vector clustering is to identify subgroups within the vector-variate observations (Ma and Zhong, 2008; Shen and Huang, 2010; Shen et al., 2012; Wang et al., 2013). Biclustering is the extension of clustering to twoway data where both the observations (rows) and the features (columns) of a data matrix are simultaneously grouped together (Hartigan, 1972; Madeira and Oliveira, 2004; Busygin et al., 2008). In spite of their prevalence, these approaches are not directly applicable to the cluster analysis of general-order (general-way) tensors. On the other hand, existing methods for co-clustering general D-way arrays, for D 3, employ one of three strategies: (i) extensions of spectral clustering to tensors (Wu et al., 2016b), (ii) directly clustering the subarrays along each dimension, or way, of the tensor using either k-means or variants on it (Jegelka et al., 2009), and (iii) low rank tensor decompositions (Sun et al., 2009; Papalexakis et al., 2013; Zhao et al., 2016). While all these existing approaches may demonstrate good empirical performance, they have limitations. For instance, the spectral co-clustering method proposed by Wu et al. (2016b) is limited to nonnegative tensors and the Co Te C method proposed by Jegelka et al. (2009), like k-means, requires specifying the number of clusters along each dimension as a tuning parameter. Most importantly, none of the existing methods provide statistical guarantees for recovering an underlying co-clustering structure. There is a conspicuous gap between statistical guarantees and computational eﬃciency for existing tensor clustering solutions due to the nature of the non-convex formulations of the previously mentioned works. In this paper, we propose a Convex Co-clustering (Co Co) procedure that solves a convex formulation of the problem of co-clustering a D-way array for D 3. Our proposed Co Co estimator aﬀords the following advantages over existing tensor co-clustering methods.

(i) Under modest assumptions on the data generating process, the Co Co estimator is guaranteed to recover an underlying co-clustering structure with high probability. In particular, we establish a non-asymptotic error bound for the Co Co estimator, which reveals a surprising blessing of dimensionality phenomenon: As the dimensions of the array increase, the Co Co estimator is still consistent even if the number of

Provable Convex Co-clustering of Tensors

underlying co-clusters grows as a function of the number of elements in the tensor sample. More importantly, an underlying co-clustering structure can be consistently recovered with even a single tensor sample, which is a typical case in real applications. This phenomenon does not exist in vector or matrix-variate cluster analysis.

(ii) The Co Co estimator possesses stability guarantees. In particular, the Co Co estimator is Lipschitz continuous in the data and jointly continuous in the data and its tuning parameter. We emphasize that Lipschitz continuity in the data guarantees that perturbations in the data lead to graceful and commensurate variations in the cluster assignments, and the continuity in the tuning parameter can be leveraged to expedite computation through warm starts.

(iii) The Co Co estimator can be iteratively computed with convergence guarantees via an accelerated ﬁrst order method with storage and per-iteration cost that is linear in the size of the data.

In short, the Co Co estimator comes with (i) statistical guarantees, (ii) practically relevant stability guarantees at all sample sizes, and (iii) an algorithm with polynomial complexity. The theoretical properties of our Co Co estimator are supported by extensive simulation studies. To demonstrate its business impact, we apply the Co Co estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to help advertising planning. Our work is related to, but also clearly distinct from, a number of recent developments in cluster analysis. The ﬁrst related line of research tackles convex clustering (Hocking et al., 2011; Zhu et al., 2014; Chi and Lange, 2015; Chen et al., 2015; Tan and Witten, 2015; Wang et al., 2018; Radchenko and Mukherjee, 2017) and convex biclustering (Chi et al., 2017). These existing methods are not directly applicable to general-order tensors, however. Importantly, our Co Co estimator enjoys a unique blessing of dimensionality phenomenon that has not been established in the aforementioned approaches. Moreover, the Co Co estimator is similar in spirit to a recent series of work approximating a noisy observed array with an array that is smooth with respect to some latent organization associated with each dimension of the array (Gavish and Coifman, 2012; Ankenman, 2014; Mishne et al., 2016; Yair et al., 2017). Our proposed Co Co procedure seeks an approximating array that is smooth with respect to a latent clustering along each dimension of the array. While Co Co shares features with these array approximation techniques, namely the use of data-driven similarity graphs along tensor modes, a key distinction between our Co Co estimator and these methods is that Co Co produces an approximating array that explicitly recovers hard co-clustering assignments. As we will see shortly, focusing our attention in this work on the co-clustering model paves the way to the discovery and explicit characterization of new and interesting fundamental behavior in ﬁnding intrinsic organization within tensors. The rest of the paper is organized as follows. In Section 2, we review standard facts and results about tensors that we will use. In Section 3, we introduce our convex formulation of the co-clustering problem. In Section 4, we establish the stability properties and prediction error bounds of the Co Co estimator. In Section 5, we describe the algorithm used to compute the Co Co estimator. In Section 6, we discuss how to specify weights used in our Co Co estimator, and in Section 7 we give guidance on how to set and select tuning

Chi, Gaines, Sun, Zhou, and Yang

parameters used in the Co Co estimator in practice. In Section 8, we present simulation results. In Section 9, we discuss the results of applying the Co Co estimator to co-cluster a real data tensor from online advertising. In Section 10, we close with a discussion. The Appendix contains a brief review of the two main tensor decompositions that are discussed in this paper, all technical proofs, as well as additional experiments.

2. Preliminaries

2.1 Notation

We adopt the terminology and notation used by Kolda and Bader (2009). We call the number of ways or modes of a tensor its order. Vectors are tensors of order one and denoted by boldface lowercase letters, e.g. a. Matrices are tensors of order two and denoted by boldface capital letters, e.g. A. Tensors of higher-order, namely order three and greater, we denote by boldface Euler script letters, e.g. A. Thus, if A represent a D-way data array of size n1 n2 n D, we say A is a tensor of order D. We denote scalars by lowercase letters, e.g. a. We denote the ith element of a vector a by ai, the ijth element of a matrix A by aij, the ijkth element of a third-order tensor A by aijk, and so on.

We can extract a subarray of a tensor by ﬁxing a subset of its indices. For example, by ﬁxing the ﬁrst index of a matrix to be i, we extract the ith row of the matrix, and by ﬁxing the second index of a matrix to be j, we extract a jth column of the matrix. We use a colon to indicate all elements of a mode. Consequently, we denote the ith row of a matrix A by Ai: and the jth column of a matrix A by A:j. Fibers are the subarrays of a tensor obtained by ﬁxing all but one of its indices. In the case of a matrix, a mode-1 ﬁber is a matrix column and a mode-2 ﬁber is a matrix row. Slices are the two-dimensional subarrays of a tensor obtained by ﬁxing all but two indices. For example, a third-order tensor A has three sets of slices denoted by Ai::, A:j:, and A::k.

2.2 Basic Tensor Operations

It is often convenient to reorder the elements of a D-way array into a matrix or a vector. Reordering a tensor s elements into a matrix is referred to as matricization, while reordering its elements into a vector is referred to as vectorization. There are many ways to reorder a tensor into a matrix or vector. In this paper, we use a canonical mode-d matricization, where the mode-d ﬁbers of a D-way tensor A Rn1 n2 n D become the columns of a matrix A(d) Rnd n d, where n d = Q j =d nj. Recall that the column-major vectorization of a matrix maps a matrix A Rp q to the vector a Rpq by stacking the columns of A on top of each other, namely a = AT :1 AT :2 AT :q T Rpq. In this paper, we take the vectorization of a D-way tensor A, denoted vec(A), to be the column-major vectorization of the mode-1 matriciziation of A, namely vec(A) = vec(A(1)) Rn, where n = Q d nd the total number of elements in A. As a shorthand, when the context leaves no ambiguity, we denote this vectorization of a tensor A by its boldface lowercase version a.

The Frobenius norm of a D-way tensor A Rn1 n2 n D is the natural generalization of the Frobenius norm of a matrix, namely it is the square root of the sum of the squares

Provable Convex Co-clustering of Tensors

of all its elements,

i D=1 a2 i1i2 i D.

The Frobenius norm of a tensor is equivalent to the ℓ2-norm of the vectorization of the tensor, namely A F = a 2. Let A be a tensor in Rn1 n2 n D and B be a matrix in Rm nd. The d-mode (matrix) product of the tensor A with the matrix B, denoted by A d B, is the tensor of size n1 nd 1 m nd+1 n D whose (i1, i2, , id 1, j, id+1, , i D)th element is given by

(A d B)i1...id 1jid+1 i D =

id=1 ai1i2 i Dbjid,

for j {1, . . . , m}. The vectorization of the d-mode product A d B can be expressed as

vec(A d B) = (In D Ind+1 B Ind 1 In1)a, (1)

where Ip is the p-by-p identity matrix and denotes the Kronecker product between two matrices. The identity given in (1) generalizes the well known formula for the column-major vectorization of a product of two matrices, namely vec(BA) = (I B)a.

3. A Convex Formulation of Co-clustering

We ﬁrst consider a convex formulation of co-clustering problem when the data is a 3-way tensor X Rn1 n2 n3 before discussing the natural generalization to D-way tensors. Our basic assumption is that the observed data tensor is a noisy realization of an underlying tensor that exhibits a checkerbox structure modulo some unknown reordering along each of its modes. Speciﬁcally suppose that there are k1, k2, and k3 clusters along modes 1, 2, and 3 respectively. If the (i1, i2, i3)-th entry in X belongs to the cluster deﬁned by the r1th mode-1 group, r2th mode-2 group, and r3th mode-3 group, then we assume that the observed tensor element xi1i2i3 is given by

xi1i2i3 = c r1r2r3 + ϵi1i2i3, (2)

where c r1r2r3 is the mean of the co-cluster deﬁned by the r1th mode-1 partition, r2th mode2 partition, and r3th mode-3 partition, and ϵi1i2i3 are noise terms. We will specify a joint distribution on the noise terms later in Section 4.2 in order to derive prediction bounds. Thus, we model the observed tensor X as the sum of a mean tensor U Rn1 n2 n3, whose elements are expanded from the co-cluster means tensor C Rk1 k2 k3, and a noise tensor E Rn1 n2 n3. We can write this expansion explicitly by introducing a membership matrix Md {0, 1}nd kd for the dth mode, where the ikth element of Md is one if and only if the ith mode-d slice belongs to the kth mode-d cluster for k {1, . . . , kd}. We require that each row of the membership matrix sum to one, namely Md1 = 1, to ensure that each of the mode-d slices belongs to exactly one of the kd mode-d clusters. Then,

U = C 1 M1 2 M2 3 M3.

Chi, Gaines, Sun, Zhou, and Yang

Figure 1: A 3-way tensor with a checkerbox structure

Figure 1 illustrates an underlying mean tensor U after permuting the slices along each of the modes to reveal a checkerbox structure. The co-clustering model in (2) is the 3-way analogue of the checkerboard mean model often employed in biclustering data matrices (Madeira and Oliveira, 2004; Tan and Witten, 2014; Chi et al., 2017). Moreover, the tensor C of co-cluster means corresponds to the tensor of cluster centers in the tensor clustering work by Jegelka et al. (2009). The model is complete and exclusive in that each tensor element is assigned to exactly one co-cluster. This is in contrast to models that allow potentially overlapping co-clusters (Lazzeroni and Owen, 2002; Bergmann et al., 2003; Turner et al., 2005; Huang et al., 2008; Witten et al., 2009; Lee et al., 2010; Sill et al., 2011; Bhar et al., 2015). Estimating the model in (2) consists of ﬁnding (i) the partitions along each mode and (ii) the mean values of each of the k1k2k3 co-clusters. Estimating c r1r2r3, given the mode clustering assignments is trivial. Let G1, G2, and G3 denote the indices of the r1th mode-1, r2th mode-2, and r3th mode-3 groups respectively. If the noise terms ϵi1i2i3 are iid N(0, σ2) for some positive σ2, then the maximum likelihood estimate of c r1r2r3 is simply the sample mean of the entries of X over the indices deﬁned by G1, G2, and G3, namely

ˆc r1r2r3 = 1 |G1||G2|||G3|

i3 G3 xi1i2i3.

Finding the partitions G1, G2, and G3, on the other hand, is a combinatorially hard problem. In recent years, however, many combinatorially hard problems, that initially appear computationally intractable, have been successfully attacked by solving a convex relaxation to the original combinatorial optimization problem. Perhaps the most celebrated convex relaxations is the lasso (Tibshirani, 1996), which simultaneously performs variable selection and parameter estimation for ﬁtting sparse regression models by minimizing a non-smooth convex criterion. In light of the lasso s success, we propose to simultaneously identify partitions along the modes of X and estimate the co-cluster means by minimizing the following convex objective function

Fγ(U) = 1 2 X U 2 F + γ R1(U) + R2(U) + R3(U)

| {z } R(U)

Provable Convex Co-clustering of Tensors

i<j w1,ij Ui:: Uj:: F

i<j w2,ij U:i: U:j: F

i<j w3,ij U::i U::j F.

By seeking the minimizer ˆUγ Rn1 n2 n3 of (3), we have cast co-clustering as a signal approximation problem, modeled as a penalized regression, to estimate the true co-cluster means tensor U . In the following discussion, we drop the dependence of γ in ˆUγ and denote our estimator as ˆU when there is no confusion. The quadratic term in (3) quantiﬁes how well U approximates X, while the regularization term R(U) in (3) penalizes deviations away from a checkerbox pattern. The nonnegative parameter γ tunes the relative emphasis on these two terms. The parameters wd,ij are nonnegative weights whose purpose will be discussed shortly. To appreciate how the regularization term R(U) steers the minimizer of (3) towards a checkerbox pattern, consider the eﬀect of one of the terms Rd(U) in isolation. Speciﬁcally, suppose that R(U) = R1(U). When γ is zero, the minimum of (3) is attained when U = X. Or stated another way, Ui:: = Xi:: for i {1, . . . , n1}. As γ increases, the mode-1 slices Ui:: will shrink towards each other and in fact coalesce due to the non-diﬀerentiability of the Frobenius norm at zero. In other words, as γ gets larger, the pairwise diﬀerences of the mode-1 slices of ˆU will become increasingly sparser. Sparsity in these pairwise diﬀerences leads to a natural partitioning assignment. Two mode-1 slices Xi:: and Xj:: are assigned to the same mode-1 partition if Ui:: = Uj::. Under mild regularity conditions, that we will spell out in Section 4, for suﬃciently large γ, all mode-1 slices ˆU will be identical and therefore belong to a single cluster. Similar behavior holds if R(U) = R2(U) or R(U) = R3(U). When R(U) includes all three terms Rd(U) for d = 1, 2, 3, pairs of mode-1, mode-2, and mode-3 slices are simultaneously shrunk towards each other and coalesce as the parameter γ increases. By coupling clustering along each of the modes simultaneously, our formulation explicitly seeks out a solution with a checkerbox mean structure. Moreover, we will show in Section 4 that the solution ˆU produces an entire solution path of checkerbox co-clustering estimates that varies continuously in γ. The solution path spans a range of models from the least smoothed model, where ˆU is X and each tensor element occupies its own co-cluster, to the most smoothed model, where all the elements of ˆU are identical and all tensor elements belong to a single co-cluster. The nonnegative weights wd,ij ﬁne tune the shrinkage of the slices along the dth mode. For example, if w1,ij > w1,i j , then there will be more pressure for Ui:: and Uj:: to fuse than for Ui :: and Uj :: to fuse as γ increases. Thus, the weight wd,ij quantiﬁes the similarity between the ith and jth mode-d slices. A very large wd,ij indicates that the two slices are very similar, while a very small wd,ij indicates that they are very dissimilar. These pairwise similarities motivate a graphical view of clustering. For the dth mode, deﬁne the set Ed as the edge set of a similarity graph. Each slice is a node in the graph and the set Ed contains an edge (i, j) if and only if wd,ij > 0. Figure 2 shows an example of a mode-1 similarity

Chi, Gaines, Sun, Zhou, and Yang

Figure 2: A graph that summarizes the similarities between pairs of the mode-1 subarrays. Only edges with positive weight are drawn.

graph, which corresponds to a tensor with seven mode-1 slices and positive weights that deﬁne the edge set

E1 = {(1, 2), (2, 3), (4, 5), (4, 6), (6, 7)}.

Given the connectivity of the graph, as γ increases, the slices U1::, U2::, and U3:: will be shrunk towards each other while the slices U4::, U5::, U6:: and U7:: shrunk towards each other. Since wd,ij = 0 for any (i, j) / Ed, we can express the penalty terms for the dth mode as

(i,j) Ed wd,ij Ui:: Uj:: F.

The graph in Figure 2 makes readily apparent that the convex objective in (3) separates over the connected components of the similarity graph for the mode-d slices. Consequently, one can solve for the optimal U component by component. Without loss of generality, we assume that the weights are such that all the similarity graphs are connected. Before leaving this preliminary description of the weights, however, we want to emphasize that in practice weights are set once in a data-adaptive manner and should be considered empirically chosen hyper-parameters rather than tuning parameters. Further discussion of the weights and practical recommendations for specifying them will be discussed in Section 6. Having familiarized ourselves with the convex co-clustering of a 3-way array, we now present the natural extension of (3) for clustering the ﬁbers of a general higher-order tensor X Rn1 n D along all its D modes. Let d,ij = e T i e T j where ei is the ith standard basis vector in Rnd. The objective function of our convex co-clustering for a general higher-order tensor is as follows.

Fγ(U) = 1 2 X U 2 F + γ

(i,j) Ed wd,ij U d d,ij F. (4)

The diﬀerence between the convex triclustering objective (3) and the general convex co-clustering objective (4) is in the penalty terms. Previously in (3) we penalized the diﬀerence between pairs slices whereas in (4) we penalize the diﬀerences between pairs of mode-d subarrays. Note that the function Fγ(U) deﬁned in (4) has a unique global minimizer. This follows immediately from the fact that Fγ(U) is strongly convex. The unique global minimizer of

Provable Convex Co-clustering of Tensors

Fγ(U) is our proposed Co Co estimator, which is denoted by ˆU for the remainder of the paper. At times it will be more convenient to work with vectors rather than tensors. By applying the identity in (1), we can rewrite the objective function in (4) in terms of the vectorizations of U and X as follows

Fγ(u) = 1 2 x u 2 2 + γ

(i,j) Ed wd,ij Ad,iju 2. (5)

where Ad,ij is the n d-by-n matrix

Ad,ij = In D Ind+1 d,ij Ind 1 In1 (6)

where Ind is the nd-by-nd identity matrix. We will refer to the unique global minimizer of (5), ˆu = arg minu Fγ(u), as the vectorized version of our Co Co estimator.

Remark 1 The fusion penalties Rd(U) are a composition of the group lasso (Yuan and Lin, 2006) and the fused lasso (Tibshirani et al., 2005), a special case of the generalized lasso (Tibshirani and Taylor, 2011). When only a single mode is being clustered and only one of the terms Rd(U) is employed, we recover the objective function in the convex clustering problem (Pelckmans et al., 2005; She, 2010; Lindsten et al., 2011; Hocking et al., 2011; Sharpnack et al., 2012; Zhu et al., 2014; Chi and Lange, 2015; Radchenko and Mukherjee, 2017). Most prior work on convex clustering employ an element-wise ℓ1-norm penalty on pairwise diﬀerences, as in the original fused lasso, however, ℓ2-norm and ℓ -norm have also been considered (Hocking et al., 2011; Chi and Lange, 2015). In this paper, we restrict ourselves to the ℓ2-norm for two reasons. First, the ℓ2-norm is rotationally invariant. In general, we are reluctant to adopt a procedure whose co-clustering output may non-trivially change when the coordinate representation of the data along one of its modes is trivially changed. Second, the ℓ2-norm promotes the group-wise shrinkage of pairwise diﬀerences of subarrays along each mode leading to more straightforward partitioning along each mode. Pairwise diﬀerences are either exactly zero or not. When the tensor is a matrix and the rows and columns are being simultaneously clustered, we recover the objective function in the convex biclustering problem (Chi et al., 2017). In general, the fusion penalties Rd(U) shrink solutions to vector valued functions that are piece-wise constant over the mode-d similarity graph deﬁned by the weights wd,ij. Viewed this way, we can see our approach as simultaneously performing the network lasso (Hallac et al., 2015) on D similarity graphs.

Remark 2 The Co Co estimator is invariant to permutations in the data tensor X in the following sense. Suppose ˆU and ˆU are the Co Co estimators when the data tensors are respectively X and X = X 1Π1 2 DΠD where Π1 {0, 1}n1 n1, . . . , ΠD {0, 1}n D n D are permutation matrices, namely ΠT d Πd = I. In words, X can be obtained from X by permuting the subarrays of X along the dth mode according to Πd for d = 1, . . . , D, and X can be recovered from X by permuting along the dth mode according to ΠT d for d = 1, . . . , D. Since U 1 Π1 2 D ΠD F = U F, it follows that

ˆU = ˆU 1 Π1 2 D ΠD and ˆU = ˆU 1 ΠT 1 2 D ΠT D.

Chi, Gaines, Sun, Zhou, and Yang

Permutation invariance is important because it means that the Co Co estimator is essentially unaltered by any reshuﬄing along the modes of the data tensor.

Remark 3 Given the co-clustering structure assumed in (2), one may wonder how much is added by explicitly seeking a co-clustering over clustering along each mode independently. In other words, why not solve D independent convex clustering problems with R(U) = Rd(U)? To provide some intuition on why co-clustering should be preferred over independently clustering each mode, consider the following problem. Imagine trying to cluster row vectors xi R10,000 for i = 1, . . . , 100 drawn from a two-component mixture of Gaussians, namely

xi iid 1 2N(µ, σ2I) + 1

2N(ν, σ2I).

This is a challenging clustering problem due to the disproportionately small number of observations compared to the number of features. If, however, we were told that µj = µ1 and νj = ν1 for j = 1, . . . , 5, 000 and µj = µ2 and νj = ν2 for i = 5, 001, . . . , 10, 000, in other words that the features were clustered into two groups, our fortunes have reversed and we now have an abundance of observations compared to the number of eﬀective features. Even if we lack a clear-cut clustering structure in the features, this example suggests that leveraging similarity structure along the columns can expedite identifying similarity structure along the rows, and vice versa. Indeed, if there is an underlying checkerbox mean tensor we may expect that simultaneously clustering along each mode should make the task of clustering along any one given mode easier. Our prediction error result presented in Section 4.2 in fact supports this suspicion (See Remark 10).

4. Properties

We ﬁrst discuss how the Co Co estimator ˆU behaves as a function of the data tensor X, the tuning parameter γ, and the weights wd,ij. We will then present its statistical properties under mild conditions on the data generating process. We highlight that these properties hold regardless of the algorithm used to minimize (4), as they are intrinsic to its convex formulation. All proofs are given in Appendix B and Appendix C.

4.1 Stability Properties

The Co Co estimator varies smoothly with respect to X, γ, and {wd,ij}. Let Wd = {wd,ij} denote the weights matrix for mode d.

Proposition 4 The minimizer ˆU of (4) is jointly continuous in (X, γ, W1, W2, . . . , WD).

As noted earlier, in practice we will typically ﬁx the weights wd,ij and compute the Co Co estimator over a grid of the penalization parameters γ in order to select a ﬁnal Co Co estimator from among the computed candidate estimators of varying levels of smoothness. Since (4) does not admit a closed form minimizer, we resort to iterative algorithms for computing the Co Co estimator. Continuity of ˆU in γ can be leveraged to expedite computation through warm starts, namely using the solution ˆUγ as the initial guess for iteratively computing ˆUγ where γ is slightly larger or smaller than γ. Due to the continuity of ˆU in γ, small changes

Provable Convex Co-clustering of Tensors

in γ will result in small changes in ˆU. Empirically the use of warm starts can lead to a non-trivial reduction in computation time (Chi and Lange, 2015). From the continuity in γ, we also see that convex co-clustering performs continuous co-clustering just as the lasso (Tibshirani, 1996) performs continuous variable selection. The penalization parameter γ tunes the complexity of the Co Co estimator. Clearly when γ = 0, the Co Co estimator coincides with the data tensor, namely ˆU = X. The key to understanding the Co Co estimator s behavior as γ increases is to recognize that the penalty functions Rd(U) are semi-norms. Under suitable conditions on the weights given in Assumption 4.1 below, Rd(U) vanishes if and only if the mode-d subarrays of U are identical.

Assumption 4.1 For any pair of mode-d subarrays, indexed by i and j with i < j, there exists a sequence of indices i k l j along which the weights, wd,ik, . . . , wd,lj are positive.

Proposition 5 Under Assumption 4.1, Rd(U) = 0 if and only if U(d) = 1c T for some c Rn d.

To give some intuition for Proposition 5, note that the term Rd(U) separates over the connected components of the mode-d similarity graph. Therefore, the term Rd(U) penalizes variation in the mode-d subarrays over the connected components of the mode-d similarity graph. Assumption 4.1, states that the mode-d similarity graph is connected. Thus, the only way for Rd(U) to attain its minimum value and vanish under Assumption 4.1, is if there is no variation in U along its mode-d subarrays. Proposition 5 suggests that if Assumption 4.1 holds for all d = 1, . . . , D then as γ increases the Co Co estimator converges to the solution of the following constrained optimization problem:

min u 1 2 x u 2 F subject to u = c1 for some c R,

the solution to which is just the global mean x, whose entries are all identically the average value of x over all its entries. The next result formalizes our intuition that as γ increases, the Co Co estimator will eventually coincide with x.

Proposition 6 Suppose Assumption 4.1 holds for d = 1, . . . , D, then Fγ(U) is minimized by the grand mean X for γ suﬃciently large.

Thus, as γ increases from 0, the Co Co estimator ˆU traces a continuous solution path that starts from n co-clusters, consisting of ui1 i D = xi1 i D, to a single co-cluster, where ui1 i D = x T1/n for all i1, . . . , i D. For a ﬁxed γ, we can derive an explicit bound on sensitivity of the Co Co estimator to perturbations in the data.

Proposition 7 The minimizer ˆU of (4) is a nonexpansive or 1-Lipschitz function of the data tensor X, namely

Chi, Gaines, Sun, Zhou, and Yang

ˆU(X) ˆU( X) F X X F.

Nonexpansivity of ˆU in X provides an attractive stability result. Since ˆU varies smoothly with the data, small perturbations in the data are guaranteed to not lead to large variability of ˆU, or consequently large variability in the cluster assignments. In a special case of our method, Chi et al. (2017) showed empirically that the co-clustering assignments made by the 2-way version of the Co Co estimator was noticeably less sensitive to perturbations in the data than those made by several existing biclustering algorithms.

4.2 Statistical Properties

We next provide a ﬁnite sample bound for the prediction error of the Co Co estimator. For simplicity, we consider the case where we take uniform weights within a mode in (5), namely wd,ij = wd,i j = 1/nd for all i, j, i , j {1, . . . , nd}. Such uniform weight assumption has also been imposed in the analysis of the vector-version of convex clustering (Tan and Witten, 2015). In order to derive the estimation error of ˆu, we ﬁrst deﬁne an important deﬁnition for the noise and introduce two regularity conditions.

Deﬁnition 8 (Vu and Wang (2015)) We say a random vector y Rn is M-concentrated if there are constants C1, C2 > 0 such that for any convex, 1-Lipschitz function φ : Rn R and any t > 0,

P φ(y) E[φ(y)] t C1 exp C2t2

The M-concentrated random variable is more general than the Gaussian or sub-Gaussian random variables, and it allows dependence in its coordinates. Vu and Wang (2015) provided a few examples of M-concentrated random variables. For instance, if the coordinates of y are iid standard Gaussian, then y is 1-concentrated. If the coordinates of y are independent and M-bounded, then y is M-concentrated. If the coordinates of y come from a random walk with certain mixing properties, then y is M-concentrated for some M.

Assumption 4.2 (Model) We assume the true cluster center C Rk1 k D has a checkerbox structure such that the mode-d subarrays have kd diﬀerent values (number of clusters along the dth mode), and each entry of C is bounded above by a constant C0 > 0. Deﬁne U Rn1 n D as the true parameter expanded based on C , namely

U = C 1 M1 2 M2 3 D MD,

where Md {0, 1}nd kd are binary mode-d cluster membership matrices such that Md1 = 1. Denote u = vec(U ) Rn with n = QD d=1 nd. We assume the samples belonging to the (r1, . . . , r D)-th cluster satisfy

xi1,...,i D = c r1,...,r D + ϵi1,...,i D,

with id {1, . . . , nd} and rd {1, . . . , kd}. Furthermore, we assume ϵ = vec(E) is a M-concentrated random variable deﬁned in (8) with mean zero.

Provable Convex Co-clustering of Tensors

The checkerbox means model in Assumption 4.2 provides the underlying cluster structure of the tensor data. As a special case, Assumption 4.2 with D = 2 reduces to the model assumption underlying convex biclustering (Chi et al., 2017). In contrast to the independent sub-Gaussian condition assumed in vector-version convex clustering (Tan and Witten, 2015), our error condition is much weaker since we allow for non-sub-Gaussian distributions as well as allow for dependence among its coordinates.

Assumption 4.3 (Tuning) The tuning parameter γ satisﬁes

D γ 2c0 log(n) n

for some constant c0 > 1.

Theorem 9 Suppose that Assumption 4.2 and Assumption 4.3 hold. The estimation error of ˆu in (5) with uniform weights satisﬁes,

ˆu u 2 2 1 D

nd + log(n) nnd

j =d kj, (7)

with a high probability, where C = 12c0C2 0 is a positive constant, and kd is the true number of clusters in the dth mode.

Theorem 9 provides a ﬁnite sample error bound for the proposed Co Co tensor estimator. Our theoretical bound allows the number of clusters in each mode to diverge, which reﬂects a typical large-scale clustering scenario in big tensor data. A notable consequence of Theorem 9 is that, when D 3, namely a higher-order tensor with at least 3 modes, the Co Co estimator can achieve estimation consistency along all the D modes even when we only have one tensor sample. Here the sample size refers to the number of available tensor samples. In our tensor clustering problem, we only have access to one tensor sample. This property is uniquely enjoyed by co-clustering of tensor data with D 3, and has not been previously established in the existing literature on vector clustering or biclustering. To see this, when nd are of the same order as n0, and kd are of the same order as k0, a suﬃcient condition for the consistency is that n0 and k0 = o n(D 2)/(D 1) 0 up to a log term. When D = 3, the Co Co estimator is consistent so long as the number of clusters k0 in each mode diverges slightly slower than n0. Remarkably, as we have more modes in the tensor data, this constraint on the rate of divergence of k0 gets weaker. In short, we reap a unique and surprisingly welcome blessing of dimensionality phenomenon in the tensor co-clustering problem.

Remark 10 Next we discuss the connections of our bound (7) with prior results in the literature. An intermediate step in the proof of Theorem 9 indicates that the estimation error in the dth mode is on the order of 1/nd + log(n)/ nnd + log(n) q

nd Q j =d kj/n d. In the clustering along the rows of a data matrix, our rate matches with that established for vector-version convex clustering (Tan and Witten, 2015), up to a log term p

log(n). Such

Chi, Gaines, Sun, Zhou, and Yang

a log term is due to that fact that Tan and Witten (2015) considers the error to be iid sub Gaussian while we consider a general M-concentrated error. In practice, the iid assumption on the noise ϵ = vec(E) could be restrictive. Consequently, our theoretical analysis is built upon a new concentration inequality of quadratic forms recently developed in Vu and Wang (2015). In addition, our rate reveals an interesting theoretical property of the convex biclustering method proposed by Chi et al. (2017). When D = 2, our rate indicates that the estimation error along the row and column of the data matrix is log(n1n2) p

n1k2/n2 and log(n1n2) p

n2k1/n1, respectively. Clearly, both errors can not converge to zero simultaneously. This indicates a disadvantage of matricizing a data tensor for co-clustering.

5. Estimation Algorithm

We next discuss a simple ﬁrst order method for computing the solution to the convex coclustering problem. The proposed algorithm generalizes the variable splitting approach introduced for convex clustering problem described in Chi and Lange (2015) to the Co Co problem. The key observation is that the Lagrangian dual of an equivalent formulation of the convex co-clustering problem is a constrained least squares problem that can be iteratively solved using the classic projected gradient algorithm.

5.1 A Lagrangian Dual of the Co Co Problem

Recall that we seek to minimize the objective function in (5)

Fγ(u) = 1 2 x u 2 2 + γ

l Ed wd,l Ad,lu 2.

Note that we have enumerated the edge indices in Ed to simplify the notation for the following derivation. We perform variable splitting and introduce the dummy variables vd,l = Ad,lu. Let Vd denote the n d |Ed| matrix whose lth column is vd,l. Further denote the vectorization of Vd by vd = vec(Vd) and let v = v T 1 v T 2 v T D T denote the vector obtained by stacking the vectors vd on top of each other. We now solve the equivalent equality constrained minimization

min v,u 1 2 x u 2 2 + γ

l Ed wd,l vd,l 2 subject to vd = Adu,

where Ad = (In D Ind+1 Φd Ind 1 In1) and Φd is the oriented edge-vertex incidence matrix for the dth mode graph, namely

1 If node v is the head of edge l 1 If node v is the tail of edge l 0 otherwise.

We introduce dual variables λd corresponding to the equality constraint vd = Adu. Let Λd denote the n d |Ed| matrix whose lth column is λd,l. Further denote the vectorization

Provable Convex Co-clustering of Tensors

of Λd by λd = vec(Λd) and λ = λT 1 λT 2 λT D T. The Lagrangian dual objective is given by

G(λ) = 1 2 x 2 2 1

2 x ATλ 2 2

l Ed ιCd,l(λd,l),

where A = AT 1 AT 2 AT D T and ιCd,l is the indicator function of the closed convex set Cd,l = {z : z 2 γwd,l}, namely ιCd,l is the function that vanishes on the set of Cd,l and is inﬁnity on the complement of Cd,l. Details on the derivation of the dual objective G(λ) are provided in Appendix D. Maximizing the dual objective G(λ) is equivalent to solving the following constrained least squares problem:

min λ C 1 2 x ATλ 2 2, (8)

where C = {λ : λd,l Cd,l, l Ed, d = 1, . . . , D}. We can recover the primal solution via the relationship:

ˆu = x ATˆλ,

where ˆλ is a solution to the dual problem (8). The dual problem (8) has at least one solution by the Weierstrass extreme value theorem, but the solution may not be unique since AT has a non-trivial kernel. Nonetheless, our Co Co estimator ˆu is still unique since ATˆλ1 = ATˆλ2 for any solutions ˆλ1, ˆλ2 to the problem (8). We numerically solve the constrained least squares problem in (8) with the projected gradient algorithm, which alternates between taking a gradient step and projecting onto the set C. Algorithm 1 provides pseudocode of the projected gradient algorithm, which has several good features. The projected gradient algorithm is guaranteed to converge to a global minimizer of (8). Its per-iteration and storage costs using the weight choices, described in Section 6, are both O(Dn), namely linear in either the number of dimensions D or in the number of elements n. For a modest additional computational and storage cost, we can accelerate the projected gradient method, for example with FISTA (Beck and Teboulle, 2009) or Spa RSA (Wright et al., 2009). In our experiments, we use a version of the latter, namely FASTA (Goldstein et al., 2014, 2015). Additional details on the derivation of the algorithmic updates, convergence guarantees, computational and storage costs, as well as stopping rules can be found in Appendix E.

6. Specifying Non-Uniform Weights

In Section 4.2, we assumed uniform weights wd,ij in the penalty terms Rd(U) to establish a prediction error bound, which revealed a surprising and beneﬁcial blessing of dimensionality phenomenon. Although this simplifying assumption gives clarity and insight into how the co-clustering problem gets easier as the number of modes increases, in practice choosing non-uniform weights can substantially improve the quality of the clustering results. In the context of convex clustering, Chen et al. (2015) and Chi and Lange (2015) provided

Chi, Gaines, Sun, Zhou, and Yang

Algorithm 1 Convex Co-Clustering (Co Co) Estimation Algorithm

Initialize λ(0); for m = 0, 1, . . .

u(m+1) = x ATλ(m) Gradient Step for d = 1, . . . , D do

for l Ed do

λ(m+1) d,l = PCd,l λ(m) d,l + ηAd,lu(m+1) Projection Step end for end for until convergence

163 243 323 403 483 563 Number Observations, n

Adjusted Rand Index

Co Co Non Uniform Weight Co Co Uniform Weight

Figure 3: Uniform versus non-uniform weights: Average Adjusted Rand Index for an increasing size. Here n = n3 0 refers to a tensor of size n0 n0 n0.

empirical evidence that convex clustering with uniform weights struggled to produce exact sparsity in the pairwise diﬀerences of smooth estimates when there was not a strong separation between groups. Indeed, similar phenomena were observed in earlier work on the related clustered lasso (She, 2010). Several related works (She, 2010; Hocking et al., 2011; Chen et al., 2015; Chi and Lange, 2015) recommend a weight assignment strategy described below. In addition, the use of sparse weights can also lead to non-trivial improvements in both computational time and clustering performance (Chi and Lange, 2015; Chi et al., 2017). To illustrate the practical value of non-uniform weights, we compare Co Co s ability to recover co-clusters, using both uniform and non-uniform weights, as the size of a 3-way tensor increases when there are two clusters per mode with balanced cluster sizes along each mode. We assess the quality of the recovered clustering performance using the Adjusted Rand Index (ARI). The ARI (Hubert and Arabie, 1985) varies between -1 and 1, where 1 indicates a perfect match between two clustering assignments whereas a value close to zero indicates the two clustering assignments match about as might be expected if they were both randomly generated. Negative values indicate that there is less agreement between clusterings than expected from random partitions.

Provable Convex Co-clustering of Tensors

Figure 3 shows a comparison between using non-uniform weights that are described in Section 6.2 and uniform weights. Each plotted point in Figure 3 is the average ARI over 100 replicates. For Co Co using non-uniform weights, the smoothing parameter γ is chosen with the data-driven extended BIC method that is detailed in Section 7.1. In contrast, for Co Co using uniform weights, γ is chosen as the value that produces the estimator that minimizes the true but unknown MSE. We see that while using uniform weights in Co Co leads to recovering co-clusters exactly once a suﬃcient number of samples have been acquired, using non-uniform weights enables Co Co to recover the co-clusters exactly with notably fewer samples. The results of this experiment are especially remarkable because Co Co using non-uniform weights and a dataadaptive choice of γ outperformed Co Co using uniform weights and an ideally chosen oracle value of γ. As in the case of convex clustering, using non-uniform weights can lead to signiﬁcantly better performance over using uniform weights in practice. We give some explanation for why this is expected in Section 6.3 but leave it to future work to develop theory proving this performance improvement. Nonetheless based on this observation, we employ non-uniform weights in Co Co for the empirical studies presented later in the paper.

6.1 Basic Procedure for Specifying Weights

We ﬁrst describe our basic two step procedure for constructing weights before elaborating on the ﬁnal reﬁnements used in our numerical experiments.

Step 1: We ﬁrst calculate pre-weights wd,ij between the ith and jth mode-d subarrays as

wd,ij = ιk {i,j} exp τd X(d),i: X(d),j: 2 F . (9)

The ﬁrst factor on the right hand side of equation (9), ιk {i,j}, is an indicator function that equals 1 if the jth slice is among the ith slice s k-nearest neighbors (or vice versa) and 0 othewise. The purpose of this term is to control the sparsity of the weights. The corresponding tuning parameter k inﬂuences the connectivity of the mode-d similarity graph. One can explore diﬀerent levels of granularity in the clustering by varying k (Chen et al., 2015). As a default, one can use the smallest k such that the similarity graph is still connected. Note it is not necessary to calculate the exact k-nearest neighbors, which scales quadratically in the number of ﬁbers in the mode. A fast approximation to the k-nearest neighbors is suﬃcient for the sake of inducing sparsity into the weights. Chi and Lange (2015) provided two reasons for using k-nearest neighbor weights. First, we wish to prioritize fusions between pairs of subarrays that are most similar; the subarrays that are most dissimilar should be the last pair of subarrays to fuse as the smoothing parameter γ increases. Second, we wish to use a sparse similarity graph as the computational and storage complexity of the estimation algorithm is proportional to the number of non-zero edges in the similarity graphs (Appendix E). Using k-nearest-neighbors weights accomplishes both goals. The second factor on the right hand side of equation (9) is the Gaussian kernel, which takes on larger values for pairs of mode-d subarrays that are more similar to each other. Chi and Steinerberger (2019) give a detailed theoretical justiﬁcation for using weights like the Gaussian kernel weights in the context of convex clustering. For space considerations,

Chi, Gaines, Sun, Zhou, and Yang

we refer readers interested in these technical details to their work and give a brief intuitive rationale for the employing the Gaussian kernel here. Intuitively, the weights should be inversely proportional to the distance between the ith and jth mode-d subarrays (Chen et al., 2015; Chi et al., 2017). The inverse of the nonnegative parameter τd is a measure of scale. In practice, we can set it to be the median Euclidean distance between the ith and jth mode-d subarrays that are k-nearest neighbors of each other. A value of τd = 0 corresponds to uniform weights. Note that with minor modiﬁcation, we can make the inverse scale parameter to be pair dependent as described in Zelnik-Manor and Perona (2005).

Step 2: To obtain the mode-d weights wd,ij, we normalize the mode-d pre-weights wd,ij to sum to p

nd/n. The normalization step puts the penalty terms Rd(U) on the same scale and ensures that clustering along any given single mode will not dominate the entire co-clustering as γ increases.

6.2 Improving Weights via the Tucker Decomposition

In our preliminary experiments, we found that substituting a low-rank approximation of X, namely a Tucker decomposition X, in place of X in (9) led to a marked improvement in coclustering performance. To understand the boost in performance suppose that X = U + E with U having a checkerbox structure and the entries of E are iid N(0, σ2) for simplicity. Further suppose that the ith and jth mode-d subarrays of U belong to the same partition and ιk {i,j} = 1. Then

wd,ij = exp τd E d ij 2 F = exp 2τdσ2Zd,ij ,

where Z = E d ij 2 F 2σ2 is distributed as a χ2 random variable with nd degrees of freedom. If we were able to perfectly denoise the tensor X so that σ = 0, then the pre-weight wd,ij would be set to its maximal value of 1, the ideal value for wd,ij since we have assumed the ith and jth mode-d subarrays belong to the same partition. Thus, if we can reduce σ2, namely denoise the observed tensor X, we can approach the ideal value of pre-weights. Note that we are more focused with approaching the ideal pre-weight values for pairs of subarrays that belong to the same partition and not concerned with pairs of subarrays in diﬀerent partitions as the Gaussian kernel weights decay very rapidly. The Tucker decomposition is eﬀective at reducing σ2 when U has a checkerbox pattern as the checkerbox pattern is a low-rank tensor that can be eﬀectively approximated with the Tucker decomposition.

Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. In our simulation studies described in Section 8, we use two diﬀerent methods for choosing the rank as a robustness check to ensure our Co Co estimator s performance does not crucially depend on the rank selection method. Details on these two methods can be found in Appendix F. While we found the Tucker decomposition to work well in practice, we suspect that other methods of denoising the tensor may work just as well or could possibly be more eﬀective. We leave it to future work to explore alternatives to the Tucker decomposition.

Provable Convex Co-clustering of Tensors

6.3 Weights and Folded-Concave Penalties

We conclude our discussion on weights by highlighting how they provide a connection between convex clustering and other penalized regression-based clustering methods that use folded-concave penalties (Pan et al., 2013; Xiang et al., 2013; Zhu et al., 2013; Marchetti and Zhou, 2014; Wu et al., 2016a). Suppose we seek to minimize the objective

fγ(u) = 1 2 x u 2 2 + γ

(i,j) Ed ϕd ( Ad,iju 2) , (10)

where each ϕd : [0, ) 7 [0, ) has the following properties: (i) ϕd is concave and diﬀerentiable on (0, ), (ii) ϕd vanishes at the origin, and (iii) the directional derivative of ϕd exists and is positive at the origin. Such ϕd is collectively referred to as a folded-concave penalty; prominent examples of such function include the smoothly clipped absolute deviation (Fan and Li, 2001) or minimax concave penalty (Zhang, 2010). Since ϕd is concave and diﬀerentiable, for all positive z and z

ϕd(z) ϕd( z) + ϕ d( z)(z z). (11)

The inequality (11) indicates that the ﬁrst order Taylor expansion of a diﬀerentiable concave function ϕd provides a tight global upper bound at the expansion point z. Thus, we can construct a function that is a tight upper bound of the function fγ(u)

gγ(u | u) = 1 2 x u 2 2 + γ

(i,j) Ed wd,ij Ad,iju 2 + c, (12)

where the constant c does not depend on u and wd,ij are weights that depend on u, namely

wd,ij = ϕ d ( Ad,ij u 2) . (13)

Note that if we take u to be the vectorization of the Tucker approximation of the data, vec( X), and ϕd(z) to be the following variation on the error function

ϕd(z) = 1 n d P (i,j) Ed wd,ij

0 e τdω2dω,

then the function given in (10) coincides with the Co Co objective using the prescribed Tucker derived Gaussian kernel weights. The function gγ(u | u) is said to majorize the function fγ(u) at the point u (Lange et al., 2000) and minimizing it corresponds to performing one-step of the local linearapproximation algorithm (Zou and Li, 2008; Schifano et al., 2010) which is a special case of the majorization-minimization (MM) algorithm (Lange et al., 2000). The corresponding MM algorithm would consist of repeating the following two steps: (i) using a previous Co Co estimate U to compute weights wd,ij according to (13), and (ii) computing a new Co Co estimate using the new weights. In practice, we have found one-step to be adequate, however. Indeed, Zou and Li (2008) showed that the solution to the one-step algorithm was often suﬃcient in terms of its statistical estimation accuracy.

Chi, Gaines, Sun, Zhou, and Yang

7. Other Practical Issues

In this section, we address other considerations for using the method in practice, namely how to choose the tuning parameter γ and how to recover the partitions along each mode from the Co Co estimator ˆU.

7.1 Choosing γ

The ﬁrst major practical consideration is how to choose γ to produce a ﬁnal co-clustering result. Since co-clustering is an exploratory method, it may be suitable for a user to manually inspect a sequence of Co Co estimators ˆUγ for a range of γ and use domain knowledge tied to a speciﬁc application to select γ to recover a co-clustering assignment of a desired complexity. Since this approach is time consuming and requires expert knowledge, an automated, data-driven procedure for selecting γ is desirable. Cross-validation (Stone, 1974; Geisser, 1975) and stability selection (Meinshausen and B uhlmann, 2010) are popular techniques for tuning parameter selection, but since both methods are based on resampling, they are unattractive in the tensor setting due to the computational burden. We turn to the extended Bayesian Information Criterion (e BIC) proposed by Chen and Chen (2008, 2012), as it does not rely on resampling and thus is not as computationally costly as crossvalidation or stability selection.

e BIC(γ) = n log RSSγ

+ 2dfγ log(n),

where RSSγ is the residual sum of squares X ˆUγ 2 F and dfγ is the degrees of freedom for a particular value of γ. We use the number of co-clusters in the Co Co estimator ˆUγ as an estimate of dfγ, which is consistent with the spirit of degrees of freedom since each cocluster mean is an estimated parameter. This criterion balances between model ﬁtting and model complexity, and a similar version has been commonly employed in tuning parameter selection of tensor data analysis (Zhou et al., 2013; Sun et al., 2017). The e BIC is calculated on a grid of values S = {γ1, γ2, . . . γs}, and we select the optimal γ, denoted γ , which corresponds to the smallest value of the e BIC over S, namely

γ = arg min γ S e BIC(γ).

7.2 Recovering the Partitions along Each Mode

The second major practical consideration is how to extract the partitions from the Co Co estimator ˆU. Recall that the ith and jth mode-d subtensors belong to the same partition if vd,ij = U d ij = 0. Conversely, the ith and jth mode-d subtensors do not belong to the same partition if vd,ij = 0. Thus, a mode-d partition consists of the maximal set of mode-d subarrays such that for any pair i and j in this collection vd,ij = 0. We can automatically identify these maximal sets by extending a simple procedure employed by Chi and Lange (2015) for extracting clusters in the convex clustering problem. Identifying partitions along the dth mode is equivalent to ﬁnding connected components of a graph, where each node corresponds to a subarray along the dth mode, and there is an edge between nodes i and j if and only if vd,ij = 0.

Provable Convex Co-clustering of Tensors

We would like to read oﬀwhich centroids have fused as the amount of regularization increases, namely determine partition assignments as a function of γ. Such assignments can be performed in O(nd) operations, using the diﬀerences variable Vd. We simply apply breadth-ﬁrst search to identify the connected components of the following graph induced by the Vd. The graph identiﬁes a node with every data point and places an edge between the lth pair of points if and only if vl = 0. Each connected component corresponds to a partition. Note that the graph constructed to determine partitions is not the same as the graph described in Section 3 with illustrative examples in Figure 2. We emphasize that the recovered partition along each mode does not depend on the ordering of the input data X, since it is based oﬀof the pairwise diﬀerences along each mode, namely Vd for d = 1, . . . , D. Finally, we note that due to ﬁnite precision limitations, the diﬀerence variables vd,ij will likely not be exactly 0. In Appendix E.4, we detail a simple and principled procedure for ensuring sparsity in these diﬀerence variables.

8. Simulation Studies

To investigate the performance of the Co Co estimator in identifying co-clusters in tensor data, we ﬁrst explore some simulated examples. We compare our Co Co estimator to a kmeans based approach that is representative of various tensor generalizations of the spectral clustering method common in the tensor clustering literature (Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Wu et al., 2016b). We refer to this method as CPD+k-means. The CPD+k-means method (Papalexakis et al., 2013; Sun and Li, 2019) ﬁrst performs a rank-R CP decomposition on the D-way tensor X to reduce the dimensionality of the problem, and then independently applies k-means clustering to the rows of each of the D factor matrix from the resulting CP decomposition. The k-means algorithm has also been used to cluster the factor matrices resulting from a Tucker decomposition (Acar et al., 2006; Sun et al., 2006; Kolda and Sun, 2008; Sun et al., 2009; Kutty et al., 2011; Liu et al., 2013b; Zhang et al., 2013; Cao et al., 2015; Oh et al., 2017). We also considered this Tucker+kmeans method in initial experiments, but its co-clustering performance was inferior to that of CPD+k-means so we only report co-clustering performance results for CPD+k-means in the comparison experiments that follow. Note, however, that we still use the Tucker decomposition to compute Co Co weights wd,ij as described Section 6. Both Co Co and CPD+kmeans account for the multiway structure of the data. To assess the importance of accounting for this structure, we also include comparisons with the Co Te C method (Jegelka et al., 2009), which applied k-means clustering along each mode and does not account for the multiway structure of the data. All methods being compared have tuning parameters that need to be set. For the rank of the CP decomposition needed in CPD+k-means, we consider R {2, 3, 4, 5} and use the tuning procedure in Sun et al. (2017) to automatically select the rank. A CP decomposition is then performed using the chosen rank, and those factor matrices are the input into the k-means algorithm. A well known drawback of k-means is that the number of clusters k needs to be speciﬁed a priori. Several methods for selecting k have been proposed in the literature, and we use the gap statistic developed by Tibshirani et al. (2001) to select an optimal k from the speciﬁed possible values. Since Co Co estimates an entire solution path of mode-clustering results, ranging from nd clusters to a single cluster along mode d,

Chi, Gaines, Sun, Zhou, and Yang

1 2 3 4 5 6 7 8 9 10 Noise Level, σ

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

Figure 4: Checkerbox Simulation Results: Impact of Noise Level. Two balanced clusters per mode across diﬀerent levels of homoskedastic noise for n1 = n2 = n3 = 60. For each method, the conﬁdence interval is calculated as the mean value plus/minus one standard error.

we consider a rather large set of possible k values to make the methods more comparable. Appendix G gives a more detailed description of the CPD+k-means procedure and the selection of its tuning parameters. Co Te C, which applies k-means clustering along each mode independently, also requires specifying the number of cluster along each mode. As in CPD+k-means, we also select this parameter along each mode using the gap statistic. As described in Section 6, we employ a Tucker approximation to the data tensor in constructing weights wd,ij. In computing the Tucker decomposition we used one of two methods for selecting the rank. In the plots within this section, TD1 denotes the results where the Tucker rank was chosen using the SCORE algorithm (Yokota et al., 2017), while TD2 denotes results where the rank was chosen using a heuristic. Detailed discussion on these two methods are in Appendix F. The results presented in this section report the average Co Co estimator performance quantiﬁed by the ARI across 200 simulated replicates. All simulations were performed in Matlab using the Tensor Toolbox (Bader et al., 2015). All the following plots, except the heatmaps in Figure 13, were made using the open source R package ggplot2 (Wickham, 2009).

8.1 Cubical Tensors, Checkerbox Pattern

For the ﬁrst and main simulation setting, we study clustering data in a cubical tensor generated by a basic checkerbox mean model according to Assumption 4.2. Each entry in the observed data tensor is generated according to the underlying model (2) with independent errors ϵi1i2i3 N(0, σ2 r1r2r3). Unless speciﬁed otherwise, there are two true clusters along each mode for a total of eight underlying co-clusters.

8.1.1 Balanced Cluster Sizes and Homoskedastic Noise

To get an initial feel for how the diﬀerent co-clustering methods perform at recovering the true underlying checkerbox structure, we ﬁrst consider a situation where the clusters corresponding to the two classes along each mode are all equally-sized, or balanced, and

Provable Convex Co-clustering of Tensors

104 105 106 n = n1n2n3

Co Co CPD + kmeans

Figure 5: Timing Results: Balanced Cluster Size and Homoskedastic Noise. Two balanced clusters per mode with a ﬁxed level of homoskedastic noise for n1 = n2 = n3 = 20, 30, 60, and 100. Vertical and horizontal axes are on a log scale.

share the same error variance, namely σr1r2r3 = σ for all r1, r2, and r3. The average coclustering performance for this setting in a tensor with dimensions n1 = n2 = n3 = 60 are given in Figure 4 for diﬀerent noise levels. Figure 4 shows that all three methods perform well when the noise level is low (σ = 1). As the noise level increases, however, CPD+k-means experiences an immediate and noticeable drop oﬀin performance. Co Te C s performance decays even more rapidly highlighting the importance of accounting for multiway structure. The Co Co estimator, on the other hand, is able to maintain near-perfect performance until the noise level becomes rather high (σ = 8). Figure 5 shows how the run times of Co Co and CPD+k-means vary as the size of a cubic tensor, n = n1n2n3 with n1 = n2 = n3 takes on the values 203, 303, 603, and 1003. These run times include all computations needed to ﬁt and select a ﬁnal model. For Co Co, a sequence of models were ﬁt over a grid of γ parameters, and a ﬁnal γ parameter was chosen using the e BIC. For CPD+k-means, a sequence of models were ﬁt over a grid of possible (k1, k2, k3) parameters corresponding to the 3 factor matrices, and a ﬁnal triple of (k1, k2, k3) parameters were chosen using the gap statistic. Timing comparisons were performed on a 3.2 GHz quad-core Intel Core i5 processor and 8 GB of RAM. The run time for Co Co scales linearly in the size of the data tensor as expected, namely proportionately with n3 1. Nonetheless, as also might be expected, the clustering performance enjoyed by Co Co does not come for free, and the simpler but less reliable CPD+k-means algorithm enjoys a better scaling as the tensor size grows. Timing results were similar for the following experiments and are omitted for space considerations.

8.1.2 Imbalanced Cluster Sizes

When comparing clustering methods, one factor of interest is the extent to which the relative sizes of the clusters impact clustering performance. To investigate this, we again use a cubical tensor of size n1 = n2 = n3 = 60 but introduce diﬀerent levels of cluster size imbalance along each mode, which we quantify via the ratio of the number of samples in cluster 2 of mode d and the total number of samples along mode d, for d = 1, 2, 3. Figure 6a

Chi, Gaines, Sun, Zhou, and Yang

0.5 0.4 0.3 0.2 0.1 Size Ratio, nd2 nd

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(a) Low Noise

0.5 0.4 0.3 0.2 0.1 Size Ratio, nd2 nd

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(b) High Noise

Figure 6: Checkerbox Simulation Results: Impact of Cluster Size Imbalance. Two imbalanced clusters per mode with either low or high homoskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6.

shows that when the noise level is low, CPD+k-means is unaﬀected by the imbalance until the size of cluster 2 is less than 30% of the mode s length. At this point, the performance of CPD+k-means drops oﬀsigniﬁcantly and it performs as well as a random clustering assignment when the sizes are highly skewed (nd2/nd = 0.1). The Co Co estimator is more or less invariant to the imbalance, and its performance is almost perfect across all levels of cluster size imbalance. Figure 6b shows that the Co Co estimator exhibits a slight deterioration in performance only when the cluster size ratio is 0.1 in the high noise case. In both low and high noise scenarios, Co Te C performs poorly.

8.1.3 Heteroskedastic Noise

Another factor of interest is how the clustering methods perform when there is heteroskedasticity in the variability of the two classes. Figure 7 displays the co-clustering performance for diﬀerent degrees of heteroskedasticity, as measured by the standard deviation for class 2 relative to class 1 s standard deviation, σ2/σ1. In the low noise setting, the Co Co estimator is immune to the heteroskedasticity until the noise levels diﬀer by a factor of 4. CPD+kmeans in contrast is very sensitive to a deviation from homoskedasticty, experiencing a decline even when the noise ratio increases from 1 to only 1.5. The Co Co estimator fares worse in the high noise setting and also has a drop in performance with a small deviation from homoskedasticty. Once class 2 s standard deviation is more than double the standard deviation for class 1, all three methods are essentially the same as random clustering. This result is not terribly surprising since, in the high noise setting, this would result in one class having a very high standard deviation of σ2 = 12. In both low and high noise scenarios, Co Te C performs poorly.

Provable Convex Co-clustering of Tensors

1 1.5 2 2.5 3 4 Noise Ratio, σ2 σ1

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(a) Low Noise

1 1.5 2 2.5 3 4 Noise Ratio, σ2 σ1

G GG G GGGG 0.00

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(b) High Noise

Figure 7: Checkerbox Simulation Results: Impact of Heteroskedasticity. Two balanced clusters per mode with either low or high heteroskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ1 = 3 while high noise refers to σ1 = 6.

8.1.4 Different Clustering Structures

So far, we have considered only a simple situation where there are exactly two true clusters along each mode, for a total of eight triclusters. Another factor of practical importance is how the clustering methods perform when there are more than two clusters per mode, and also when the number of clusters along each mode diﬀers. We investigate both of these settings in this section. As before, the tensor is a perfect cube with n1 = n2 = n3 = 60 observations along each mode and an underlying checkerbox pattern. To gauge the performance, we again focus the attention on how the methods perform in the presence of both low and high noise.

Low High Low High (3, 3, 3) (2, 3, 4) Noise: Clusters:

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Tec

Figure 8: Checkerbox Simulation Results: Impact of Clustering Structure. Diﬀerent balanced clusters per mode with either low or high homoskedastic noise for n1 = n2 = n3 = 60. Low noise corresponds to σ = 3 while high noise refers to σ = 6.

Chi, Gaines, Sun, Zhou, and Yang

The ﬁrst situation studied is one in which there are three true clusters along each mode, resulting in a total of 27 triclusters. The left hand side of the graphs in Figure 8 show the results from this simulation setting. The graphs show that Co Co estimator consistently outperforms CPD+k-means and Co Te C in this setting across both noise levels. The Co Co estimator is able to recover the true co-clusters almost perfectly, while CPD+k-means struggles to handle the increased number of clusters per mode. We also investigated the clustering performance when the number of clusters per mode varies. In this setting, there are two, three, and four clusters along modes one, two, and three, respectively. From the right hand side of the graphs in Figure 8, we can see that the results are similar to the situation with three clusters per mode. CPD+k-means again performs very poorly across both noise levels, while convex co-clustering is again able to essentially recover the true co-clustering structure. Compared to the setting with three clusters per mode, CPD+k-means performs slightly worse in the face of a more complex clustering structure, while convex co-clustering is able to handle it in stride. These results bode well for convex co-clustering as the basic clustering structure of only two clusters per mode is unlikely to be observed in practice.

8.2 Rectangular Tensors

Up to this point, to get an initial feel for Co Co s performance, we restricted our attention to cubical tensors with the same number of observations per mode so as to avoid changing too many factors at once. It is unlikely that the data tensor at hand will be a perfect cube, however, so it is important to understand the clustering performance when the methods are applied to rectangular tensors. Now we turn to cluster a rectangular tensor with one short mode and two longer modes. Two additional simulations involving rectangular tensors can be found in Appendix H. Figure 9 shows that Co Co performs very well and better than CPD+k-means and Co Te C at the lower noise level (σ = 3) but has a sharp decrease in ARI at the higher noise level (σ = 4). The decline is more pronounced for the longer modes (Figure 9b and Figure 9a) as the short mode (Figure 9a) is still able to maintain perfect performance despite the increase in noise. This is not surprising, since the shorter mode has eﬀectively more samples. Moreover, we see the blessing of dimensionality at work when the number of samples along the short mode are doubled (n1 = 20, n2 = n3 = 50), the performance along the two longer modes improves drastically in the high noise setting. We ﬁnally note that, along the shorter mode, the use of the heuristic in determining the rank of the Tucker decomposition for calculating the weights performs better than the SCORE algorithm method along modes 1 and 2, though ultimately the co-clustering performance is comparable. This may indicate that the SCORE algorithm struggles to correctly identify the optimal Tucker rank for short modes in the presence of relatively higher noise, while the heuristic is more immune to the noise level as it is based simply on the dimensions of the tensor.

8.3 CANDECOMP/PARAFAC Model

In Section 8.1, we saw that the Co Co estimator performs well and typically better than CPD+k-means when clustering tensors whose co-clusters have an underlying checkerbox

Provable Convex Co-clustering of Tensors

3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(a) Adjusted Rand Index, Mode 1

3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(b) Adjusted Rand Index, Mode 2

3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(c) Adjusted Rand Index, Mode 3

3 4 3 4 (10, 50, 50) (20, 50, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(d) Adjusted Rand Index, Triclusters

Figure 9: Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with one short mode and two longer modes. Average adjusted rand index plus/minus one standard error for diﬀerent noise levels and mode lengths.

pattern. To evaluate the performance of our Co Co estimator under model misspeciﬁcation, we consider the generative model as the following CP decomposition model. We ﬁrst construct the factor matrix A R80 2 and construct the following rank-2 CP means tensor

i=1 ai ai ai,

where denotes the outer product. We then added varying levels of Gaussian noise to the U to generate the observed data tensor. We consider two diﬀerent types of factor matrices. As shown in Figure 10, one shape consists of two half-moon clusters (Hocking et al., 2011; Chi and Lange, 2015; Tan and Witten, 2015) while the other shape contains a bullseye, similar to the two-circles shape studied by Ng et al. (2002) and Tan and Witten (2015). In either case, the triangles in Figure 10 correspond to the ﬁrst 40 rows of A, whereas the

Chi, Gaines, Sun, Zhou, and Yang

(a) Bullseye

G G G G G G G G G G G G GGGGGGGGG G G G G G G G G G G G G G G G G G G G

(b) Half Moons

Figure 10: Factor Matrices for the CP Models.

circles correspond to the second 40 rows of A. Note that this data generating mechanism should favor the CPD+k-means method. Figure 11 shows the simulation results for using the CP model with these two nonconvex shapes generating the data. The discrepancy in performance between the Co Co estimator and the other two methods is quite large. The Co Co estimator almost perfectly identiﬁes the true co-clusters. In contrast, both CPD+k-means and Co Te C perform very poorly, even when the noise variance is small. The poor performance of CPD+k-means and Co Te C are not completely surprising as other have noted the diﬃculty that k-means methods have in recovering non-convex clusters (Ng et al., 2002; Hocking et al., 2011; Tan and Witten, 2015). These results give us some assurances that the Co Co estimator is able to still perform well even under some model misspeciﬁcation since the true co-clusters do not have a checkerbox pattern.

8.4 Comparison with Convex Biclustering

It is natural to ask how much additional gain there is in using Co Co over convex biclustering (Chi et al., 2017) on the matricizations of a data tensor. To answer this question, we compare Co Co to the following strategy for applying convex biclustering to estimate coclusters. We explain the strategy for a 3-way tensor; the generalization to D-way tensors is straightforward. We ﬁrst matricize the tensor X along mode-1 to obtain the matrix X(1), apply convex biclustering on X(1), and retain the mode-1 clustering results. Note that the mode-2 and mode-3 ﬁbers have been mixed together through the matricization process. We then repeat the two-step procedure for mode-2 and mode-3. The ﬁnal cocluster estimates are obtained by taking the cross-products of the mode-1, mode-2, and mode-3 cluster assignments. We consider two illustrative scenarios to understand the value of preserving the full multiway structure with Co Co: a balanced case and imbalanced case. In the balanced

Provable Convex Co-clustering of Tensors

Bullseye Half Moons Data Generation

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

Figure 11: CP Model Simulation Results. Two balanced clusters per mode with low homoskedastic noise for n1 = n2 = n3 = 40. Bullseye and Half Moons refer to the shape embedded in the factor matrices used to generate the true tensor.

1 2 3 4 5 6 7 8 9 Noise Level, σ

GGGG GGGG G G G GGGG GGG G

Adjusted Rand Index

Co Co TD1 Co Co TD2 Biclust TD1 Biclust TD2

(a) Balanced

1 2 3 4 5 6 7 8 9 Noise Level, σ

GG GG GG GG GG

Adjusted Rand Index

Co Co TD1 Co Co TD2 Biclust TD1 Biclust TD2

(b) Imbalanced

Figure 12: A Comparison between Co Co and Convex Biclustering Average Adjusted Rand Index plus/minus one standard error for diﬀerent noise levels.

case, we have a 3-way data tensor X R60 60 60 with two clusters along each mode, where clusters are of equal size and homoskedastic iid Gaussian noise has been added to all elements of the tensor. This scenario is similar to the one shown in Figure 4. In the imbalanced case, we have a 3-way data tensor X R30 40 80. There are two clusters along mode-1 of sizes 10 and 20, three clusters along mode-2 of sizes 8, 12, and 20, and four clusters along mode-3 of sizes 5, 10, 20, and 45. Homoskedastic iid Gaussian noise has been added to all elements of the tensor. Finally, we note that the empirical performance of convex biclustering, like that of Co Co s, depends on choosing good weights for the rows and columns of the input data matrix (Chi et al., 2017). To create a fair comparison, we construct convex biclustering weights based oﬀof the same TD1 and TD2 denoising procedure used for Co Co, putting the preprocessing for both methods on equal footing.

Chi, Gaines, Sun, Zhou, and Yang

Figure 12a and Figure 12b show the co-clustering performance of Co Co and the convex biclustering method in the balanced and imbalanced cases respectively. We see that in the balanced case, Co Co s performance is marginally better than that of the convex biclustering method. On the other hand, we see that in the imbalanced case, Co Co s performance degrades more gracefully than that of the convex biclustering method as the noise level increases. The example illustrates that Co Co has better co-cluster recovery when there is more imbalance in the data tensor - the aspect ratios of the tensor dimensions are more skewed and the number of clusters and the cluster sizes are more heterogenous.

The key formulation diﬀerence between Co Co and the convex biclustering method that provides some insight into these two results is that Co Co imposes a ﬁner level of smoothness that respects the multi-way structure in the data tensor. Imposing such ﬁner level of smoothness imparts greater robustness in the presence of increasing noise to recovering the smaller co-clusters in the imbalanced scenario. An added incentive for using Co Co and preserving the multiway structure in the data is that the gains in co-cluster recovery over the convex biclustering method do not come at a greater computational cost. Note that the computational complexity of convex biclustering is O(n), using sparse weights for the row and column similarity graphs. For a D-way tensor, the computational complexity then becomes O(Dn), which is the same as the computational complexity of Co Co applied directly on the D-way tensor.

To summarize, in comparison to the convex biclustering method, Co Co (i) does not come at additional computational costs, (ii) can recover underlying co-clustering structure in imbalanced scenarios which are more likely to be encountered in practice, and (iii) has the ability to consistently recover an underlying co-clustering structure according to Theorem 9, with even a single tensor sample, which is a typical case in real applications. Since this phenomenon does not exist in vector or matrix variate cluster analysis, the convex biclustering method lacks this theoretical guarantee.

9. Real Data Application

Having studied the performance of the Co Co estimator in a variety of simulated settings, we now turn to using the Co Co estimator on a real data set. The proprietary data set comes from a major online company and contains the click-through rates for advertisements displayed on the company s webpages from May 19, 2016 through June 15, 2016. The clickthrough rate is the number of times a user clicks on a speciﬁc advertisement divided by the number of times the advertisement was displayed. The data set contains information on 1000 users, 189 advertisements, 19 publishers, and 2 diﬀerent devices, aggregated across time. Thus, the data forms a fourth-order tensor where each entry in the tensor corresponds to the click-through rate for the given combination of user, advertisement, publisher, and device. Here a publisher refers to a diﬀerent webpage within the online company s website, such as the main home page versus a page devoted to either breaking news or sports scores. The two device types correspond to how the user accessed the page, using either a personal computer or a mobile device such as a cell phone or tablet computer. The goal in this real application is to simultaneously cluster users, advertisements, and publishers to improve user behavior targeting and advertising planning.

Provable Convex Co-clustering of Tensors

In the click-through rate tensor data, over 99% of the values are missing since one user likely has seen only a handful of the possible advertisements. If a speciﬁc advertisement is never seen by a user, it is considered as a missing value. Since the proposed Co Co estimator can only handle complete data, we ﬁrst preprocess the data by imputing the missing values before any clustering can be done. To impute the missing entries, we use the CP-based tensor completion method Jain and Oh (2014) and tune its rank via the information criterion proposed by Sun et al. (2017). This tuning method chooses the optimal rank as R = 20 from the rank list {1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22}. Finally, the imputed values are truncated to ensure all the values of the tensor are within 0 and 1 since click-through rates are proportions.

One mode of the fourth-order tensor has only two observations and those observations already have a natural grouping (device type). Therefore, for the sake of clustering we analyze the devices separately. We compare our method with CPD+k-means. Furthermore, the tuning parameter for convex co-clustering is automatically selected using the e BIC (Section 7.1) while the number of clusters in CPD+k-means is chosen via the gap statistic (Tibshirani et al., 2001). We do not include comparisons with Co Te C given its poor performance in the simulation experiments.

We ﬁrst look at the clustering results from clustering the click-through rates for users accessing the advertisements through a personal computer (PC). Table 1 contains the number of clusters identiﬁed as well as the sizes of the clusters, while Figure 13a visualizes the advertisement-by-publisher biclusters for a randomly selected user. As to be expected, the advertisement-by-publisher slices display a checkerbox pattern, which turns into a checkerbox pattern when the slices are meshed together. The clustering results for the users are omitted in this paper to ensure user privacy. However, co-clustering the tensor does not result in the loss of information that would occur if the tensor was converted into a matrix by averaging across users or ﬂattening along one of the modes. Table 1 and Figure 13a show that the Co Co estimator identiﬁes four advertisement clusters, with one cluster being much bigger than the others. The advertisements in this large cluster have click-through rates that are close to the grand average in the data set. One of the small clusters has very low click-through rates, while the other two clusters tend to have much higher click-through rates than the rest of the advertisements. On the other hand, CPD+k-means clusters the advertisements into 57 groups, which is less-useful from a practical standpoint. Many of the clusters are similarly-sized and contain only a few advertisements, likely due to the inability of CPD+k-means to handle imbalanced cluster sizes as was observed in the simulation experiments (Section 8.1.2). In terms of the publishers, the Co Co estimator identiﬁes 3 clusters while CPD+k-means does not ﬁnd any underlying grouping and simply identiﬁes one big cluster, which again is not terribly useful (Table 1). We next provide some interpretations of the obtained clustering results of the publishers. One way online advertisers can reach more users is by entering agreements with other companies to route traﬃc to the advertiser s website. For example, Google and Apple have a revenue-sharing agreement in which Google pays Apple a percentage of the revenue generated by searches on i Phones (Mc Garry, 2016). Similarly, the online company being studied partners with several internet service providers (ISPs) to host the defaut home pages for the ISP s customers. It would make sense that these slightly diﬀerent variants of the online company s main home

Chi, Gaines, Sun, Zhou, and Yang

Co Co Estimator CPD+kmeans

Advertisements Publisher Advertisements Publisher

Device # of clusters Cluster Sizes # of clusters Cluster Sizes # of clusters # of clusters

PC 4 (156, 22, 8, 3) 3 (4, 3, 12) 57 1 Mobile 3 (145, 22, 22) 2 (7, 12) 49 13

Table 1: Advertising Data Clustering Results

page would have similar click-through rates, and the Co Co estimator in fact assigned these variants into the same cluster. For users accessing the advertisements through a mobile device, such as a mobile phone or tablet computer, the Co Co estimator results for the advertisements are largely similar to the results for PCs (Table 1 and Figure 13b). There is one large cluster that contains click-through rates similar to the overall average, while the two other equally-sized clusters have relatively very low or very high click-through rates, respectively. The underlying click-through rates for the PC data have more variability than the mobile data, which is consistent with the identiﬁcation of an additional cluster for the PC data. As before, CPD+k-means ﬁnds a large number of advertisement clusters, most of which are roughly the same size, again likely impacted by the imbalance in the cluster sizes. When compared to the personal computer device, one diﬀerence is that the cluster with the higher click-through rates for mobile devices is larger and has a higher average click-through rate than the similar clusters for the personal computer device. This ﬁnding is consistent with research by the Pew Research Center that found that click-through rates for mobile devices are higher than for advertisements viewed on a personal computer or laptop (Mitchell et al., 2012). It is also enlightening to take a closer look at the underlying advertisements clustered across the two devices. All of the advertisements clustered in the high click-through rate cluster for the mobile devices are in the average click-through rate cluster for personal computers. In taking a closer look at the ads in these clusters, there are several ads related to online shopping for personal goods, such as jeans, workout clothes, or neck ties. It makes sense to shop for these types of goods using a mobile device, such as while at work when it is not appropriate to do so on a work computer. Conversely, all of the advertisements in either of the two higher PC click-through rate clusters are in the large, average click-through rate cluster for the mobile devices. There are several ﬁnancial-related ads in these two PC clusters, such as for mortgages or general investment advice. On the other hand, there are not many online shopping ads in those clusters, with the exception of more expensive technology-related goods that one may want to invest more time in researching before making a purchase. In terms of the publisher clusters on mobile device, Table 1 shows that the Co Co estimator identiﬁes two clusters of publishers while CPD+k-means identiﬁes 13 small clusters. Contrary to the advertisement clusters, the publisher clusters across both devices are very similar. In fact, the only diﬀerence is that the smaller cluster for the mobile device, which contains seven publishers, is split into two clusters for personal computers. This can be

Provable Convex Co-clustering of Tensors

(a) Personal Computers (b) Mobile

Figure 13: Advertisement and Publisher Click-Through Rate Biclusters for a Randomly Selected User. The rows correspond to diﬀerent advertisements and the columns correspond to diﬀerent publishers. Darker blue corresponds to higher click-through rates for a given device.

seen in the click-through rate heatmaps given in Figure 13 in looking at the right part of each heatmap. The publishers in these smaller clusters have higher click-through rates on average than those in the larger cluster. Additionally, ﬁve of the seven (71%) publishers in the high click-through rate clusters have stand-alone apps that display ads, while only three of the twelve (25%) publishers in the larger cluster do. For mobile devices, it has been observed that in-app advertisements have higher click-through rates and browser-based ads (Hof, 2014). We conjecture that this is also true for personal computer apps, which is consistent with the clustering results. Thus it again appears that the clusters identiﬁed by Co Co also make sense practically.

10. Discussion

In this paper, we formulated and studied the problem of co-clustering of tensors as a convex optimization problem. The resulting Co Co estimator enjoys features in theory and practice that are arguably lacking in existing alternatives, namely statistical consistency, stability guarantees, and an algorithm with polynomial computational complexity. Through a battery of simulations, we observed that the Co Co estimator can identify co-clustering structures under realistic scenarios such as imbalanced co-cluster sizes, imbalanced number of clusters along each mode, heteroskedasticity in the noise distribution associated with each co-cluster, and even some violation of the checkerbox mean tensor assumption. We have leveraged the power of the convex relaxation to engineer a computationally tractable co-clustering method that comes with statistical guarantees. These beneﬁts, however, do not come for free. The Co Co estimator incurs similar costs that using the lasso incurs as a surrogate for a cardinality constraint or penalty. It is well known that the lasso leads to parameter estimates that are shrunk towards zero. This shrinkage toward zero

Chi, Gaines, Sun, Zhou, and Yang

is the price for simultaneously estimating the support, or locations of the nonzero entries, in a sparse vector as well as the values of the nonzero entries. In the context of convex co-clustering, the Co Co estimator ˆU is shrunk towards the tensor X, namely the tensor whose entries are all equal to the average over all entries of X. The weights, however, play a critical role in reducing this bias. In fact, the weights can be seen as serving the same role as weights used in the adaptive lasso (Zou, 2006). There are several possible extensions and open problems that have been left for future work. First, we note that there is a gap between what our theory predicts and what seems possible from our experiments. Speciﬁcally, Theorem 9 assumes uniform weights for each mode, yet simulation experiments indicate that the Co Co estimator using Tucker derived Gaussian kernel weights (9) can signiﬁcantly outperform the Co Co estimator using uniform weights. One open problem is to derive prediction error bounds that relax the uniform weights assumption. Second, although we have developed automatic methods for constructing the weights that work well empirically, other approaches to constructing the weights is a direction of future search. For example, other tensor approximation methods, such as the use of the ℓ1-norm to make the decomposition most robust to heavy tail noise as done by Cao et al. (2015), could possibly improve the quality of the weights. Third, in this paper we have focused on additive noise that is a zero-mean M-concentrated random variable. Real data, however, may not follow such a distribution motivating coclustering procedures that can handle outliers. To address potential robustness issues, the Co Co framework could be extended to handle outliers by swapping the sum of squared residuals term in (5) with an analogous Huber loss or Tukey s Biweight function. Finally, while our ﬁrst order algorithm for co-clustering tensors scales linearly in the size of the data, data tensors inevitably will only increase in size motivating the need for more scalable algorithms for computing the Co Co estimator. A natural approach would be to adopt an existing distributed version of the proximal methods, such as one the methods proposed by Combettes and Pesquet (2011), Chen and Ozdaglar (2012), Li et al. (2013), or Eckstein (2017). Another natural approach would be to investigate if stochastic versions of the recently proposed generalized dual gradient ascent (Ho et al., 2019) could be adapted to compute the Co Co estimator. Additionally, in practice many data tensors that we would like to co-cluster may be very sparse. The ﬁrst order algorithm presented here assumes the data tensor is dense. Consequently, an important direction of future work is to investigate alternative optimization algorithms that could leverage the sparsity structure within a data tensor.

Acknowledgments

The authors thank Xu Han for his help with the simulation experiments during the revision of this work. The authors also thank the action editor and three reviewers for their helpful comments and suggestions which led to a much improved presentation. Eric Chi acknowledges support from the National Science Foundation (DMS-1752692) and National Institutes of Health (R01GM135928). Will Wei Sun acknowledges support from the Ofﬁce of Naval Research (ONR N00014-18-1-2759). Hua Zhou acknowledges support from

Provable Convex Co-clustering of Tensors

the National Institutes of Health (R01GM053275 and R01HG006139). Finally, this research collaboration was partially funded by the National Science Foundation under grant DMS-1127914 (the Statistical and Applied Mathematical Sciences Institute). Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation, the National Institutes of Health, or the Oﬃce of Naval Research.

Appendix A. Tensor Decompositions

We review two basic tensor decompositions that generalize the singular value decomposition (SVD) of a matrix: (i) the CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang, 1970; Harshman, 1970) and (ii) the Tucker decomposition (Tucker, 1966). Just as the SVD can be used to construct a lower-dimensional approximation to a data matrix, these two decompositions can be used to construct a lower dimensional approximation to a D-way tensor X Rn1 n2 n D The CP decomposition aims to approximate X by a sum of rank-one tensors, namely

i=1 a(1) i a(2) i a(D) i ,

where represents the outer product and a(d) i is the ith column of the dth factor matrix A(d) Rnd R. The positive integer R denotes the rank of the approximation. For suﬃciently large R, we can exactly represent X with a CP decomposition. The Tucker decomposition aims to approximate X by a core tensor H RR1 R2 RD multiplied by factor matrices along each of its modes, namely

X H 1 A(1) 2 A(2) 3 D A(D) =

i D=1 hi1i2 i Da(1) i1 a(2) i2 a(D) i D ,

where a(d) id is the idth column of the dth factor matrix A(d) Rnd Rd. Typically the columns of A(d) are computed to be orthonomal and can be interpreted as principal components or basis vectors for the dth mode. For suﬃciently large R1, . . . , RD, we can exactly represent X with a Tucker decomposition.

Appendix B. Proofs of Smoothness Properties

B.1 Proof of Proposition 4

Without loss of generality, we can absorb γ into the weights matrices. Thus, we seek to show the continuity of ˆU with respect to (X, W1, . . . , WD). We use the following compact representation of the weights

w = (vec(W1)T, vec(W2)T, . . . , vec(WD))T R PD d=1 (nd 2 ).

We check to see if the solution ˆU is continuous in the variable ζ = (x T, w T)T. It is easy to verify that the following function is jointly continuous in U and ζ

f(U, ζ) = 1 2 X U 2 F + R(U, w),

Chi, Gaines, Sun, Zhou, and Yang

i<j wd,ij U d d,ij F

is a convex function of U that is continuous in (U, w). Let

U (ζ) = arg min U f(U, ζ).

Since f(U, ζ) is strongly convex in U, the minimizer U (ζ) exists and is unique. We proceed with a proof by contradiction. Suppose U (ζ) is not continuous at a point ζ. Then there exists an ϵ > 0 and a sequence {ζ(m)} converging to a limit ζ such that U(m) U (ζ) F ϵ for all m where

U(m) = arg min U f(U, ζ(m)).

Since f(U, ζ) is strongly convex in U, the minimizer U(m) exists and is unique. Without loss of generality, we can assume ζ(m) ζ F 1. This fact will be used later in proving the boundedness of the sequence U(m). If U(m) is a bounded sequence, then we can pass to a convergent subsequence with limit U. Fix an arbitrary point U. Note that f(U(m), ζ(m)) f( U, ζ(m)) for all m. Since f is continuous in (U, ζ), taking limits gives us the inequality

f( U, ζ) f( U, ζ).

Since U was selected arbitrarily, it follows that U = U (ζ), which is a contradiction. It only remains for us to show that the sequence U(m) is bounded. Consider the function

g(U) = sup ζ: ζ ζ F 1

1 2 X U 2 F + R w(U).

Note that g is convex, since it is the point-wise supremum of a collection of convex functions. Since f(U, ζ(m)) g(U) and f is strongly convex in U, it follows that g(U) is also strongly convex and therefore has a unique global minimizer U such that g(U ) < . It also follows that

f(U(m), ζ(m)) f(U , ζ(m)) g(U ) (14)

for all m. By the reverse triangle inequality it follows that

U(m) F X(m) F 2 1 2 U(m) X(m) 2 F f(U(m), ζ(m)). (15)

Combining the inequalities in (14) and (15), we arrive at the conclusion that

U(m) F X(m) F 2 g(U ),

for all m. Suppose the sequence U(m) is unbounded, namely U(m) F . But since X(m)

converges to X, the left hand side must diverge. Thus, we arrive at a contradiction if U(m)

is unbounded.

Provable Convex Co-clustering of Tensors

B.2 Proof of Proposition 5

First suppose that U(d) = 1c T, namely all the mode-d subarrays of U are identical. Recall that Z = U d A if and only if Z(d) = AU(d). Therefore, Rd(U) = 0 since d,ij1c T = 0 for all (i, j) Ed. Now suppose that Rd(U) is zero. Take an arbitrary pair (i, j) with i < j. By Assumption 4.1, there exists a path i k l j along which the weights are positive. Let w denote the smallest weight along this path, namely w = min{wd,ik, . . . , wd,lj}. By the triangle inequality

U d d,ij F U d d,ik F + + U d d,lj F.

We can then conclude that

w U d d,ij F Rd(U) = 0.

It follows that e T i U(d) = e T j U(d), since w is positive. Since the pair (i, j) is arbitrary, it follows that all the rows of U(d) are identical or in other words, U(d) = 1c T for some c Rn d.

B.3 Proof of Proposition 6

We will show that there is a γmax such that for all γ γmax, the grand mean tensor X is the unique global minimizer to the primal objective (4). We will certify that X is the solution to the primal problem by showing that the optimal value of a dual problem, which lower bounds the primal, equals Fγ( X). Note that the Lagrangian dual given in (28) is a tight lower bound on Fγ(U).

2 ATλ 2 2 + λ, Ax .

For suﬃciently large γ, the solution to the dual maximization problem coincides with the solution to the unconstrained maximization problem

2 ATλ 2 2 + λ, Ax ,

whose solution is λ = AAT Ax. Plugging λ into the dual objective gives an optimal value of

1 2 AT AAT Ax 2 2 = 1 2 x I AT AAT A x 2 2.

Note that h I AT AAT A i is the projection onto the orthogonal complement of the

column space of AT, which is equivalent to the null space or kernel of A, denoted Ker(A). We will show below that Ker(A) is the span of the all ones vector. Consequently,

I AT AAT A x = 1 n x, 1 1.

Chi, Gaines, Sun, Zhou, and Yang

Note that the smallest γ such that λ Cγ is an upper bound on γmax. We now argue that Ker(A) is the span of 1 Rn. We rely on the following fact: If Φd is an incidence matrix of a connected graph with nd vertices, then the rank of Φd is nd 1 (Deo, 1974, Theorem 7.2). According to Assumption 4.1, the mode-d graphs are connected; it follows that Φd { 1, 0, 1}|Ed| nd has rank nd 1. It follows then that Ker(Φd) has dimension one. Furthermore, since each row of Φd has one 1 and one 1, it follows that 1 Ker(Φd) Rnd. A vector z Ker(A) if and only if z Ker(Ad) for all d. Recall that the rank of the Kronecker product A B is the product of the ranks of the matrices A and B. This rank property of Kronecker products of matrices implies that the dimension of Ker(Ad) equals n d. Let bi = 1n D 1nd+1 ei 1nd 1 1n1 where 1p Rp is the vector of all ones and ei Rnd is the ith standard basis vector. Then that the set of vectors B = {b1, b2, . . . , bnd} forms a basis for Ker(Ad). Take an arbitrary element from Ker(Ad), namely a vector of the form 1n a 1n , where n = QD j=d+1 nj and n = Qd 1 j=1. We will show that in order for 1n a 1d Ker(I Φd), a must be a multiple of 1nd. Consider the relevant matrix-vector product

= (1n D 1nd+1 Φda 1nd 1 1n1).

Therefore, Ad

1n a 1n = 0 if and only if Φda = 0. But the only way for Φda to be

zero is for a = c1nd for some c R. Thus, Ker(A) is the span of 1n.

B.4 Proof of Proposition 7

Note that ˆU is the proximal mapping of the closed, convex function

Then ˆU is ﬁrmly nonexpansive in X (Combettes and Wajs, 2005, Lemma 2.4). Finally, ﬁrmly nonexpansive mappings are nonexpansive, which completes the proof.

Appendix C. Proof of Theorem 9

We ﬁrst prove some auxiliary lemmas before proving our prediction error result.

C.1 Auxiliary Lemmas

The following lemma considers the concentration of a random quadratic form y TBy for a M-concentrated random vector y and a deterministic matrix B (Vu and Wang, 2015). It can be viewed as a generalization of the standard Hanson and Wright inequality for the quadratic forms of independent sub-Gaussian random variables (Hanson and Wright, 1971).

Lemma 11 Let y Rn be a M-concentrated random vector, see Deﬁnition 8. Then there are constants C, C > 0 such that for any matrix B Rn n,

P |y TBy tr(B)| t C log(n) exp C M 2 min t2

B 2 F log(n), t B 2

Provable Convex Co-clustering of Tensors

The next lemma studies the properties of the matrix Ad,ij, deﬁned in (6), in the penalty function. Denote Sd as the matrix constructed by concatenating Ad,ij, i < j vertically. That is,

Sd = AT d,12 AT d,13 . . . AT d,nd 1nd T R[(nd 2 )n d] n. (16)

Lemma 12 For each d = 1, . . . , D, the rank of the matrix Sd is (nd 1)n d. Denote σmin(Sd) and σmax(Sd) as the minimum non-zero singular value and maximum singular value of Sd, respectively. We have σmin(Sd) = σmax(Sd) = nd.

The proof of Lemma 12 follows from Lemma 1 in Tan and Witten (2015) and is omitted. According to Lemma 12, we can construct a singular value decomposition of Sd = UdΛd VT d , where Ud R[(nd 2 )n d] (nd 1)n d, Λd R(nd 1)n d (nd 1)n d, and Vd Rn (nd 1)n d. Denote

Gd = UdΛd R[(nd 2 )n d] (nd 1)n d, (17)

and its pseudo-inverse as G d R(nd 1)n d [(nd 2 )n d]. The following lemma studies the properties of Gd and G d, for each d = 1, . . . , D.

Lemma 13 For each d = 1, . . . , D, the rank of the matrix Gd is (nd 1)n d. The minimal non-zero singular value and maximal singular value of Gd are σmin(Gd) = σmax(Gd) = nd. Moreover, σmin(G d) = σmax(G d) = 1/ nd.

Lemma 13 follows directly from the conclusions in Lemma 12.

C.2 Proof of Main Theorem

We ﬁrst reformulate our optimization problem via a decomposition approach to simplify the theoretical analysis. Such strategy was developed in Liu et al. (2013a) and has been successfully applied in Tan and Witten (2015); Wang et al. (2018). Denote γd = γ/nd. Our convex tensor co-clustering method is equivalent to solving

ˆu = arg min u

1 2 x u 2 2 +

(i,j) Ed Ad,iju 2

According to the deﬁnition of Sd in (16), we deﬁne the penalty function R( ) such that

(i,j) Ed Ad,iju 2.

According to the singular value decomposition of Sd = UdΛd VT d , there exists a matrix Wd Rn n d such that Vd = [Wd, Vd] Rn n is an orthogonal matrix and WT d Vd = 0. Let αd = WT d u Rn d and βd = VT d u Rn. Clearly, we have

Wdαd + Vdβd = Wd WT d u + Vd VT d u = Vd V T d u = u, (19)

Chi, Gaines, Sun, Zhou, and Yang

for any d = 1, . . . , D. This fact together with the deﬁnition of Gd = UdΛd in (17) imply that solving our convex tensor clustering in (18) is equivalent to solving

min αd,βd,d=1,...,D

2D x Wdαd + Vdβd 2 2 + γd R(Gdβd) (20)

Denote the solution of (20) as ˆαd, ˆβd, d = 1, . . . , D, which corresponds to the estimator ˆu in (18) according to (19). Similarly, we denote the true parameters as α d, β d that corresponds to u deﬁned in Assumption 4.2. Our goal is to derive the upper bound of ˆu u 2 2 by above reparametrization. Since ˆαd, ˆβd, d = 1, . . . , D minimizes the objective function in (20), we have

2D x Wd ˆαd + Vdˆβd 2 2 + γd R(Gdˆβd)

2D x Wdα d + Vdβ d 2 2 + γd R(Gdβ d) .

Note that x ˆu 2 2 x u 2 2 = ˆu 2 2 u 2 2 2x T(ˆu u ) = ˆu u 2 2 + 2ϵT(ˆu u ), where the last equality is due to the model assumption x = u + ϵ. Therefore, we have

1 2 ˆu u 2 2 +

d=1 γd R(Gdˆβd) 1 2D

d=1 ϵT(u ˆu) +

d=1 γd R(Gdβ d)

ϵT[Wd(α d ˆαd) + Vd(β d ˆβd)] | {z }

d=1 γd R(Gdβ d). (21)

Next we derive the bound for f(ˆαd, ˆβd). Note that the optimization over αd in (20) has a closed-form since the penalty term is independent of αd. In particular, by setting the derivative of x Wdαd + Vdβd 2 2 with respect to αd to be zero, we obtain that αd = WT d (x Vdβd). This implies that

ˆαd = WT d (x Vdˆβd)

= WT d (Wdα d + Vdβ d + ϵ Vdˆβd) (22)

= α d + WT d ϵ,

where the second equality is due to x = u + ϵ and the last equality is due to the fact that WT d Vd = 0 and WT d Wd = I. According to (22), we have

f(ˆαd, ˆβd) = ϵTWd WT d ϵ + ϵTVd(β d ˆβd)]

ϵTWd WT d ϵ | {z } (I)

+ ϵTVd(β d ˆβd)] | {z } (II)

Provable Convex Co-clustering of Tensors

Bound (I): We apply the concentration inequality in Lemma 11 to bound (I). It remains to compute Wd WT d 2 F and Wd WT d 2. By construction, Wd WT d Rn n is a

projection matrix since Vd V T d = Wd WT d + Vd VT d = I. Therefore, the rank of Wd WT d is Q j =d nj, Wd WT d 2 F = Q j =d nj, Wd WT d 2 = 1, and tr(Wd WT d ) = Q j =d nj. Denote n = QD d=1 nd. By Lemma 11 and Assumption 4.2, we have

P ϵTWd WT d ϵ t + n d C log(n) exp C M 2 min t2

log(n)n d , t .

Setting t = p

n d log(n)2, we have

P ϵTWd WT d ϵ log(n) n d + n d C exp n log log(n) C M 2 log(n) o , (24)

where the right hand side converges to zero as the dimension n = QD d=1 nd . Note that our error ϵ in Assumption 4.2 is assumed to be a M-concentrated random variable. If we assume a stronger condition such that ϵ is a vector with iid sub-Gaussian, we can obtain a upper bound p

log(n)n d + n d according to the Hanson and Wright inequality (Hanson and Wright, 1971). Therefore, in spite of the relaxation in the error assumption, our bound in (24) is only up to a log-term larger. Bound (II): By deﬁnitions of Gd in (17) and G d, we have G d Gd = I. Furthermore, let G d,ij refer to the column of G d that corresponds to the index (i, j), and let Gd,ij refer to the row of Gd that corresponds to the index (i, j). We have

(II) = ϵTVd(β d ˆβd) = ϵTVd G d Gd(β d ˆβd) =

i<j ϵTVd G d,ij Gd,ij(β d ˆβd)

i<j ϵTVd G d,ij 2 Gd,ij(β d ˆβd) 2 max i<j ϵTVd G d,ij 2 | {z } II1

i<j Gd,ij(β d ˆβd) 2

Bound II1: By construction, ϵTVd G d,ij Rn d. We have

ϵTVd G d,ij 2 n d ϵTVd G d,ij ,

max i<j ϵTVd G d,ij 2 n d max i<j ϵTVd G d,ij = n d ϵTVd G d

Let ηj = e T j G d VT d ϵ R, where ej R(nd 2 )n d is the basis vector with the jth entry one and the rest zeros. According to Lemma 13 and the property of Vd which consists of singular vectors, we have σmax(Vd) = 1 and σmax(G d) = 1/ nd. Therefore, we have ηj is a M/ nd-concentrated random variable with mean zero. According to the deﬁnition of concentrated random variable in Deﬁnition 8, we have

P (|ηj| t1) C1 exp C2ndt2 1 M2

Chi, Gaines, Sun, Zhou, and Yang

Therefore, by union bound, we have

P max j |ηj| t1

(n d) exp C2ndt2 1 M2

By setting t1 =

log(n) log nd 2 n d

/nd, we have

log(n) log nd 2

for some constant C3 > 0. Hence with probability at least 1 C3/n, we have

n d log(n) log nd 2

Plugging the results in (24) and (25) into (23), we obtain that, for each d = 1, . . . , D

f(ˆαd, ˆβd) log(n) n d + n d +

n d log(n) log nd 2

i<j Gd,ij(β d ˆβd) 2.

Therefore, Assumption 4.3 on the tuning parameter γd implies that

f(ˆαd, ˆβd) log(n) n d + n d + Dγd

i<j Gd,ij(β d ˆβd) 2,

by noting that log( nd 2 n d) log(n2 dn d) 2 log(n). This combines with the inequality in (21) lead to

1 2 ˆu u 2 2 1 2D

log(n) n d + n d

d=1 γd R(Sdu ). (26)

According to the cluster structure assumption in Assumption 4.2, there are kd clusters along the dth mode of the tensor. Therefore, along each mode the true parameter U only has a few diﬀerent slices. Denote U i as the i-th mode-d subarray. Formally, we have

R(Sdu ) = X

(i,j),i<j,i,j=1,...,nd Ad,iju 2

(i,j),i<j,i,j=1,...,nd U i U j F 4C2 0

j =d kj, (27)

where C0 is a constant upper bound for the entries of U . Combining the inequalities in (26) and (27) with the condition on γd given in Assumption 4.3 implies that

1 2 ˆu u 2 2 1 2D

log(n) n d + n d + 3

2c0 log(n) n

Dividing both sides by n gives to the prediction error bound in (7). This ends the proof of Theorem 9.

Provable Convex Co-clustering of Tensors

Appendix D. Derivation of Lagrangian Dual

Let U d A denote the multiplication of U along mode d by the matrix A. Recall that for a tensor U Rn1 nd and a matrix A RL nd

vec(U d A) = (In D Ind+1 A Ind 1 In1)u,

where u = vec(U) = vec(U(1)), namely the column-major vectorization of the mode-1 matricization of the tensor U. So, Note that Y = U d A is equivalent to Y(d) = AU(d). We rewrite the penalty function Rd as follows.

l Ed wd,l U d d,l F = X

l Ed wd,l vec(U d d,l) 2 = X

l Ed wd,l Ad,lu 2,

where Ad,l = (In D Ind+1 d,l Ind 1 In1). We now write down the Lagrangian:

L(u, v, λ) = 1 2 x u 2 2 +

γwd,l vd,l 2 + λd,l, Ad,lu vd,l

2 x u 2 2 +

d=1 AT d λd, u

λd,l, vd,l γwd,l vd,l 2

2 x u 2 2 + ATλ, u

λd,l, vd,l γwd,l vd,l 2

The Lagrangian dual objective is given by G(λ) by minimizing the Lagrangian L(u, v, λ) over the primal variables u and v, namely

G(λ) = min u,v L(u, v, λ)

2 x u 2 2 + ATλ, u

l Ed max vd,l

λd,l, vd,l γwd,l vd,l 2

= 1 2 x 2 2 1

2 x ATλ 2 2

l Ed ιCd,l(λd,l), (28)

where ιCd,l is the indicator function of the closed convex set Cd,l = {z : z 2 γwd,l}. The last equality in (28) follows from the fact that the Fenchel conjugate of a norm is the indicator function of the unit dual norm ball. Recall that the Fenchel conjugate f of a function f is given by

f (λ) = sup v

λ, v f(v) .

Let B = {λ : λ 2 1} denote the unit ℓ2-norm ball. Since the ℓ2-norm is self dual, we arrive at the identity

ιB(λ) = sup v

Chi, Gaines, Sun, Zhou, and Yang

Appendix E. Projected Gradient Applied to the Lagrangian Dual

Note that the dual problem (8) has the form

minimize g(λ)

subject to λ C, (29)

where g(λ) is a convex and Lipschitz-diﬀerentiable function and the constraint set C is a closed convex set, which implies that every point λ possesses a unique orthogonal projection, PC(λ) = arg minθ C θ λ 2, onto C. When PC(λ) can be computed analytically, a simple and eﬀective iterative algorithm for solving problems like (29) is the projected gradient descent algorithm, a special case of proximal gradient descent algorithm (Combettes and Wajs, 2005; Combettes and Pesquet, 2011). Recall that projected gradient descent alternates between taking a gradient step and projecting onto the set C. Thus, at the mth iteration, we perform the following update

λ(m) = PC λ(m 1) η g(λ) , (30)

where η is a step-length parameter. Applying the update rule in (30) to the dual problem (8), we obtain the following rule for computing the mth iteration

u(m) = x ATλ(m 1)

λ(m) = PC λ(m 1) + ηAu(m) .

Note that, at the mth iteration, the gradient of the least squares objective in (8) is given by Au(m). Thus, we automatically update our Co Co estimator u(m) as part of our gradient calculation. Finally, we note that the projection onto the set C consists of independent projections onto the sets Cd,l that can be carried out in parallel.

E.1 Per-Iteration and Storage Costs

The gradient update is dominated by the matrix-vector multiplications ATλ and Au. Although A is a PD d=1|Ed|n d-by-n matrix it has only 2 PD d=1|Ed|n d non-zero elements. Thus, computing the gradient step requires O(PD d=1|Ed|n d) ﬂops. Projecting onto the set C also requires O(PD d=1|Ed|n d) ﬂops since projecting onto the set Cd,l requires O(n d) ﬂops. Thus, the per-iteration cost is O(PD d=1|Ed|n d) ﬂops. The storage cost is dominated by storing the dual variable λ, which has PD d=1|Ed|n d elements. At ﬁrst glance these storage and per-iteration costs may seem prohibitive, as |Ed| can be as large as O(n2 d) for a fully connected mode-d graph. Shrinking together all combinations of pairs of mode-d subarrays, however, typically produces poor clustering results in comparison to shrinking together mode-d subarrays that are nearest-neighbors as observed in prior work in convex clustering (Chen et al., 2015; Chi and Lange, 2015) and convex biclustering (Chi et al., 2017). Consequently, we employ sparse weights. Speciﬁcally, we keep positive weights between approximately nearest-neighbor mode-d subarrays so that |Ed| is O(nd). By using these sparse weights, the per-iteration and storage costs scale more reasonably as O(Dn), namely linearly in either the number of dimensions D or in the number of elements n. Details on our weights choices are elaborated in Section 6.

Provable Convex Co-clustering of Tensors

E.2 Convergence

The sequence of dual iterates λ(m) is guaranteed to converge to a solution ˆλ of (8) provided that the step-size parameter η is less than twice the reciprocal of the spectral radius of the matrix ATA (Combettes and Wajs, 2005, Theorem 3.4). Consequently, the sequence of primal iterates u(m) is guaranteed to converge to the Co Co estimator ˆu. We note that under the same step-size conditions, convergence of the sequence u(m) can also be guaranteed by observing that the projected gradient algorithm applied to the dual problem (8) is an example of the alternating minimization algorithm (Tseng, 1991, Proposition 2).

E.3 Monitoring Convergence via the Duality Gap

Recall that we can bound the suboptimality of the mth iterate, Fγ(u(m)) Fγ(ˆu), by the duality gap Fγ(u(m)) G(λ(m)), which can be expressed solely in terms of the mth iterate of the primal variable u(m), namely

Fγ(u(m)) G(λ(m)) = u(m) 2 2 x, u(m) + γ

l Ed wd,l Ad,lu(m) 2.

For any optimal dual solution ˆλ, the gap vanishes, namely Fγ(ˆu) = G(ˆλ). Note that computing the duality gap incurs minimal additional cost as u(m) and Ad,lu(m) are already computed as part of the gradient step. In short, including a duality gap computation will not change the O(Dn) per-iteration cost of the projected gradient algorithm. In practice, we can terminate the algorithm once the duality gap falls below some small tolerance.

E.4 Computing Mode-d Diﬀerence Variables

In Section 7.2, we explained how clustering assignments along the dth mode are made using the mode-d diﬀerence variables vd,l = U d d,l. In practice we must deal with the fact that the ˆu recovered by computing x ATˆλ may exhibit a nearly but not exactly checkerbox structure due to limitations in numerical precision. This creates a practical issue as a small but non-zero diﬀerence variable will lead to an incorrect clustering assignment. Addressing this issue, however, is simple. The projected gradient algorithm used to compute Co Co is a natural generalization of the projected gradient algorithm used in Chi and Lange (2015) for convex clustering. Consequently, we can use the obvious adaptation of the procedure for computing the diﬀerences variables in convex clustering. The following brief technical discussion is expanded in more detail in Chi and Lange (2015).

The key fact that we use is that the projected gradient algorithm is equivalent to the alternating minimization algorithm (AMA) applied to the following augmented Lagrangian function

Lη(u, v, λ) = 1

2 x u 2 2 +

h γwd,l vd,l 2 + λd,l, vd,l Ad,lu + η

2 vd,l Ad,lu 2 2 i .

Chi, Gaines, Sun, Zhou, and Yang

The mode-d diﬀerence vector vd,l is determined by the proximal map

vd,l = arg min vd,l

vd,l Ad,lu η 1λd,l 2 2 + γwd,l

= proxσd,l 2 Ad,lu η 1λd,l , (31)

where σd,l = γwd,l/η. Because the proximal mapping can produce mode-d diﬀerence variables that are exactly zero, the procedure for computing vd,l in (31) is immune to the numerical precision issues that hinder the direct computation ˆU d d,l.

Appendix F. Details on Denoising with the Tucker Decomposition for Setting Weights

Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. When applicable, a user can leverage problem-speciﬁc knowledge to select the rank for the decomposition. Nonetheless, the availability of an automatic approach is desirable to handle cases when such knowledge is unavailable. Selecting the rank in a tensor decomposition, however, is an open question (Kolda and Bader, 2009; Yokota et al., 2017). During initial experiments, a few diﬀerent methods for selecting the Tucker decomposition rank from the literature were compared: an L-curve approach that attempts to strike a balance between the decomposition s relative error and compression ratio, as implemented by the mlrankest function in the Tensorlab Matlab toolbox (Vervliet et al., 2016), minimum description length (Rissanen, 1978; Yokota et al., 2017), and the recentlyproposed SCORE algorithm (Yokota et al., 2017). Out of these, the SCORE algorithm produced the best average Co Co estimator performance. The SCORE algorithm itself includes a tuning parameter, ˆρ, and Yokota et al. (2017) suggest setting ˆρ [10 4, 10 2]. We considered ˆρ {10 4, 10 3, 10 2} and found 10 3 to perform the best, which also matches the value used in the experiments by Yokota et al. (2017). We also developed a simple yet eﬀective heuristic for choosing the rank where we set the Tucker rank for the dth mode to be the ﬂoor of nd/2. Two principles motivating the heuristic are that the rank of the decomposition should be both small relative to and also in proportion to the length of the modes. Both the SCORE algorithm and our heuristic were employed in our simulations described in Section 8 as a robustness check to ensure our Co Co estimator s performance does not crucially depend on the choice of the rank. The basic Tucker decomposition computation is accomplished by the higher order SVD (HOSVD) method (De Lathauwer et al., 2000) which computes for each mdoe k the rk leading left singular values of the mode-k matricization and stores them as a factor matrix Uk. The HOSVD then computes the core tensor by contracting the data tensor X k Uk. Thus, the main cost is computing D SVDs. This is an illustrative calculation, however, and more eﬃcient alternatives exist (Vannieuwenhoven et al., 2012; Minster et al., 2020).

Appendix G. CPD+k-means

We describe in greater detail the CPD+k-means method for co-clustering a D-way tensor X Rn1 n D. The method consists of two steps

Provable Convex Co-clustering of Tensors

Step 1. Compute a rank-R CP decomposition

i=1 a(1) i a(2) i a(D) i ,

where represents the outer product and a(d) i is the ith column of the dth factor matrix A(d) Rnd R.

Step 2. For each factor matrices A(d), apply k-means clustering on the nd rows of A(d). Note that the D applications of k-means are done independently for each mode-d factor matrix A(d).

Tuning parameters: There are two sets of tuning parameters: (i) the rank parameter R, used in Step 1, and (ii) the D cluster number parameters for each factor matrix, used in Step 2. To choose the rank parameter R, we create a candidate set of ranks Rcandidate {1, 2, 3, . . .} and select R Rcandidate using the tuning procedure in Sun et al. (2017). We then compute a CP decomposition using the selected rank R and obtain the factor matrices A(d) for d = 1, . . . , D. To choose the D cluster number parameters, we create D candidate sets of cluster numbers Kd candidate {1, 2, 3, . . . , nd} and select k d Kd candidate for d = 1, . . . , D using the gap statistic procedure in Tibshirani et al. (2001). We use the D clustering results from running k-means on the rows of each of the A(d) using k d.

Appendix H. Additional Simulations on Rectangular Tensors

The ﬁrst rectangular tensor is one in which there are two short modes (n1 = n2 = 10) and one relatively longer mode (n3 = 50). Figure 14 presents the clustering results for this tensor shape. At a lower noise level (σ = 2), Co Co performs very well and outperforms CPD+kmeans and Co Te C in terms of both single-mode clustering and co-clustering. When the noise level is bumped up (σ = 3), both methods experience a noticeable drop oﬀin their performance and now perform more similarly. Interestingly, Co Co s single-mode clustering results are better along the two shorter modes (modes 1 and 2), which is not what we expected. This provides some evidence that the performance along a mode depends on both the length of that mode as well as the lengths of the other modes. When the length of the shorter modes are increased slightly (from nd = 10 to nd = 20 for d = 1, 2), Co Co has near-perfect performance while CPD+k-means performs roughly the same as before. Thus, Co Co struggles with this tensor shape only when the short modes are really short (only 10 observations). To further investigate the mode-by-mode performance with rectangular tensors, we also apply the clustering methods to a Goldilocks tensor with mode lengths that are short, medium, and long. This setting was again motivated by the results from the previous two tensor shapes to see how the performance is impacted when the size of a longer mode is increased. The ARI results for this tensor shape are given in Figure 15d, and they are consistent with what was observed previously. When the short mode has only 10 observations, Co Co initially performs very well until the noise reaches a certain level. At this point, its performance for the longer modes declines sharply and actually performs worse

Chi, Gaines, Sun, Zhou, and Yang

2 3 2 3 (10, 10, 50) (20, 20, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(a) Adjusted Rand Index, Mode 1

2 3 2 3 (10, 10, 50) (20, 20, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(b) Adjusted Rand Index, Mode 2

2 3 2 3 (10, 10, 50) (20, 20, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(c) Adjusted Rand Index, Mode 3

2 3 2 3 (10, 10, 50) (20, 20, 50) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(d) Adjusted Rand Index, Co-clusters

Figure 14: Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with two short modes and one longer mode. Average adjusted rand index plus/minus one standard error for diﬀerent noise levels and mode lengths.

than CPD+k-means, and this pattern is more pronounced for the longest mode (n3 = 100). The overall co-clustering performance for both methods remains similar, however. As before, Co Co does not experience as much of a decrease when the shortest mode is made slightly longer (n1 = 20), and does noticeably better than CPD+k-means for the most part.

Overall, from clustering these diﬀerent tensor shapes we see that Co Co still generally performs very well and better than CPD+k-means. The main issue it encounters is when at least one mode is very short (nd = 10). Co Co performs very well a lower noise levels but has a sharp decline in performance once the noise reaches a certain level. Unexpectedly, the decline in single-mode performance is worse for the longer modes. However, even when this happens, Co Co s overall co-clustering performance is still comparable to CPD+k-means. Additionally, this pattern is much less striking when the length of the shortest mode is increased slightly.

Provable Convex Co-clustering of Tensors

3 5 3 5 (10, 50, 100) (20, 50, 100) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(a) Adjusted Rand Index, Mode 1

3 5 3 5 (10, 50, 100) (20, 50, 100) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(b) Adjusted Rand Index, Mode 2

3 5 3 5 (10, 50, 100) (20, 50, 100) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(c) Adjusted Rand Index, Mode 3

3 5 3 5 (10, 50, 100) (20, 50, 100) σ = n1, n2, n3 =

Adjusted Rand Index

Co Co TD1 Co Co TD2 CPDkmeans Co Te C

(d) Adjusted Rand Index, Co-clusters

Figure 15: Checkerbox Simulation Results: Impact of Tensor Shape. Two balanced clusters per mode with two levels of homoskedastic noise for a tensor with short, medium, and long mode lengths. Average adjusted rand index plus/minus one standard error for diﬀerent noise levels and mode lengths.

Evrim Acar and B ulent Yener. Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering, 21(1):6 20, 2009.

Evrim Acar, Seyit A. C amtepe, and B ulent Yener. Collective sampling and analysis of high order tensors for chatroom communications. In International Conference on Intelligence and Security Informatics, pages 213 224. Springer, 2006.

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014.

Jerrod I. Ankenman. Geometry and analysis of dual networks on questionnaires, 2014.

Chi, Gaines, Sun, Zhou, and Yang

Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online, February 2015. URL http://www.sandia.gov/~tgkolda/Tensor Toolbox/.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67(3):031902, 2003.

Anirban Bhar, Martin Haubrock, Anirban Mukhopadhyay, and Edgar Wingender. Multiobjective triclustering of time-series transcriptome data reveals key genes of biological processes. BMC Bioinformatics, 16(1):200, 2015.

Xuan Bi, Annie Qu, and Xiaotong Shen. Multilayer tensor factorization with applications to recommender systems. The Annals of Statistics, 46(6B):3308 3333, 2018.

Stanislav Busygin, Oleg Prokopyev, and Panos M. Pardalos. Biclustering in data mining. Computers and Operations Research, 35(9):2964 2987, 2008.

Xiaochun Cao, Xingxing Wei, Yahong Han, and Dongdai Lin. Robust face clustering via tensor decomposition. IEEE Transactions on Cybernetics, 45(11):2546 2557, 2015.

J. Douglas Carroll and Jih-Jie Chang. Analysis of individual diﬀerences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283 319, 1970.

Annie I. Chen and Asuman Ozdaglar. A fast distributed proximal-gradient method. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 601 608. IEEE, 2012.

Gary K. Chen, Eric C. Chi, John Michael O. Ranola, and Kenneth Lange. Convex clustering: An attractive alternative to hierarchical clustering. PLOS Computational Biology, 11(5): e1004228, 2015.

Jiahua Chen and Zehua Chen. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759 771, 2008.

Jiahua Chen and Zehua Chen. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pages 555 574, 2012.

Eric C. Chi and Kenneth Lange. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994 1013, 2015.

Eric C. Chi and Stefan Steinerberger. Recovering trees with convex clustering. SIAM Journal on Mathematics of Data Science, 1(3):383 407, 2019.

Eric C. Chi, Genevera I. Allen, and Richard G. Baraniuk. Convex biclustering. Biometrics, 73(1):10 19, 2017.

Provable Convex Co-clustering of Tensors

Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao, Cesar Caiafa, and Huy Anh Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32 (2):145 163, 2015.

Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point Algorithms for Inverse Problems in Science and Engineering, pages 185 212. Springer, 2011.

Patrick L. Combettes and Val erie R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168 1200, 2005.

Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253 1278, 2000.

Narsingh Deo. Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1974.

Jonathan Eckstein. A simpliﬁed form of block-iterative operator splitting and an asynchronous algorithm resembling the multi-block alternating direction method of multipliers. Journal of Optimization Theory and Applications, 173(1):155 182, Apr 2017.

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001.

Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.

Matan Gavish and Ronald R. Coifman. Sampling, denoising and compression of matrices by coherent matrix organization. Applied and Computational Harmonic Analysis, 33(3): 354 369, 2012.

Seymour Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350):320 328, 1975.

Tom Goldstein, Christoph Studer, and Richard Baraniuk. A ﬁeld guide to forward-backward splitting with a FASTA implementation. ar Xiv eprint, abs/1411.3406, 2014. URL http: //arxiv.org/abs/1411.3406.

Tom Goldstein, Christoph Studer, and Richard Baraniuk. FASTA: A generalized implementation of forward-backward splitting, January 2015. http://arxiv.org/abs/1501.04979.

Romain Guigour es, Marc Boull e, and Fabrice Rossi. Discovering patterns in time-varying graphs: A triclustering approach. Advances in Data Analysis and Classiﬁcation, pages 1 28, 2015.

Chi, Gaines, Sun, Zhou, and Yang

David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, pages 387 396, New York, NY, USA, 2015. ACM.

D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42:1079 1083, 1971.

Richard A. Harshman. Foundations of the PARAFAC procedure: Models and conditions for an explanatory multimodal factor analysis. UCLA Working Papers in Phonetics, 16:1 84, 1970.

John A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123 129, 1972.

Nhat Ho, Tianyi Lin, and Michael I. Jordan. On structured ﬁltering-clustering: Global error bound and optimal ﬁrst-order algorithms. ar Xiv:1904.07462 [stat.ML], 2019. URL https://arxiv.org/abs/1904.07462.

Toby D. Hocking, Armand Joulin, Francis Bach, and Jean-Philippe Vert. Clusterpath an algorithm for clustering using convex fusion penalties. In Lise Getoor and Tobias Scheﬀer, editors, 28th International Conference on Machine Learning, page 1. ACM, 2011.

Robert Hof. Study: Mobile Ads Actually Do Work - Especially In Apps. Forbes, August 27, 2014, August 2014. Last Accessed July 9, 2017 from https://www.forbes.com/sites/roberthof/2014/08/27/ study-mobile-ads-actually-do-work-especially-in-apps/#27ce654057aa.

Heng Huang, Chris Ding, Dijun Luo, and Tao Li. Simultaneous tensor subspace selection and clustering: The equivalence of high order SVD and k-means clustering. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 327 335. ACM, 2008.

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classiﬁcation, 2 (1):193 218, 1985.

Prateek Jain and Sewoong Oh. Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, pages 1431 1439, 2014.

Stefanie Jegelka, Suvrit Sra, and Arindam Banerjee. Approximation algorithms for tensor clustering. In International Conference on Algorithmic Learning Theory, pages 368 383. Springer, 2009.

Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009.

Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining. In 2008 Eighth IEEE International Conference on Data Mining, pages 363 372, 2008.

Provable Convex Co-clustering of Tensors

Sangeetha Kutty, Richi Nayak, and Yuefeng Li. XML documents clustering using a tensor space model. Advances in Knowledge Discovery and Data Mining, pages 488 499, 2011.

Kenneth Lange, David R. Hunter, and Ilsoon Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1 20, 2000.

Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Statistica Sinica, 12:61 86, 2002.

Mihee Lee, Haipeng Shen, Jianhua Z. Huang, and J. S. Marron. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087 1095, 2010.

Mu Li, David G. Andersen, and Alexander Smola. Distributed delayed proximal gradient methods. In NIPS Workshop on Optimization for Machine Learning, volume 3, page 3, 2013.

Fredrik Lindsten, Henrik Ohlsson, and Lennart Ljung. Just relax and come clustering! A convexiﬁcation of k-means clustering. Technical report, Link opings Universitet, 2011. URL http://www.control.isy.liu.se/research/reports/2011/2992.pdf.

Ji Liu, Lei Yuan, and Jieping Ye. Guaranteed sparse recovery under linear transformation. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 91 99. PMLR, 2013a.

Tianqi Liu, Ming Yuan, and Hongyu Zhao. Characterizing spatiotemporal transcriptome of human brain via low rank tensor decomposition. ar Xiv:1702.07449 [stat.ME], 2017. URL https://arxiv.org/abs/1702.07449.

Xinhai Liu, Shuiwang Ji, Wolfgang Gl anzel, and Bart De Moor. Multiview partitioning via tensor methods. IEEE Transactions on Knowledge and Data Engineering, 25(5): 1056 1069, 2013b.

Ping Ma and Wenxuan Zhong. Penalized clustering of large scale functional data with multiple covariates. Journal of the American Statistical Association, 103:625 636, 2008.

Sara C. Madeira and Arlindo L. Oliveira. Biclustering algorithms for biological data analysis: A survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1): 24 45, 2004.

Yuliya Marchetti and Qing Zhou. Solution path clustering with adaptive concave penalty. Electronic Journal of Statistics, 8(1):1569 1603, 2014.

Caitlin Mc Garry. Report: Google is the default i Phone search engine because it paid Apple $1 billion. Macworld, January 22, 2016, January 2016. Last Accessed July 9, 2017 from http://www.macworld.com/article/3025783/iphone-ipad/ report-google-is-the-default-iphone-search-engine-because-it-paid-apple-1-billion. html.

Nicolai Meinshausen and Peter B uhlmann. Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417 473, 2010.

Chi, Gaines, Sun, Zhou, and Yang

Rachel Minster, Arvind K. Saibaba, and Misha E. Kilmer. Randomized algorithms for low-rank tensor decompositions in the Tucker format. SIAM Journal on Mathematics of Data Science, 2(1):189 215, 2020.

Gal Mishne, Ronen Talmon, Ron Meir, Jackie Schiller, Maria Lavzin, Uri Dubin, and Ronald R. Coifman. Hierarchical coupled-geometry analysis for neuronal structure and activity pattern discovery. IEEE Journal of Selected Topics in Signal Processing, 10(7): 1238 1253, 2016.

Amy Mitchell, Tom Rosenstiel, Laura Houston Santhanam, and Leah Christian. Future of mobile news. Project for Excellence in Journalism (PEJ) Understanding News in the Information Age, 2012.

Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849 856, 2002.

Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu. S-HOT: Scalable High-Order Tucker Decomposition. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 761 770. ACM, 2017.

Wei Pan, Xiaotong Shen, and Binghui Liu. Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research, 14:1865 1889, 2013.

Evangelos E. Papalexakis, Nicholas D. Sidiropoulos, and Rasmus Bro. From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors. IEEE Transactions on Signal Processing, 61(2):493 506, 2013.

Kristiaan Pelckmans, Jos De Brabanter, Johan A.K. Suykens, and Bart L.R. De Moor. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop, 2005.

Peter Radchenko and Gourab Mukherjee. Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1527 1546, 2017.

Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465 471, 1978.

Elizabeth D. Schifano, Robert L. Strawderman, and Martin T. Wells. Majorizationminimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4:1258 1299, 2010.

James Sharpnack, Aarti Singh, and Alessandro Rinaldo. Sparsistency of the edge lasso over graphs. In Neil D. Lawrence and Mark Girolami, editors, Proceedings of the Fifteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pages 1028 1036, 2012.

Yiyuan She. Sparse regression with exact clustering. Electronic Journal of Statistics, 4: 1055 1096, 2010.

Provable Convex Co-clustering of Tensors

Xiaotong Shen and Hsin-Cheng Huang. Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 105(490):727 739, 2010.

Xiaotong Shen, Hsin-Cheng Huang, and Wei Pan. Simultaneous supervised clustering and feature selection over a graph. Biometrika, 99:899 914, 2012.

Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551 3582, 2017.

Martin Sill, Sebastian Kaiser, Axel Benner, and Annette Kopp-Schneider. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15):2089 2097, 2011.

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), pages 111 147, 1974.

Jimeng Sun, Dacheng Tao, and Christos Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 374 383. ACM, 2006.

Jimeng Sun, Spiros Papadimitriou, Ching-Yung Lin, Nan Cao, Shixia Liu, and Weihong Qian. Multivis: Content-based social network exploration through multi-way visual analysis. In Proceedings of the 2009 SIAM International Conference on Data Mining, pages 1064 1075. SIAM, 2009.

Will Wei Sun and Lexin Li. Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894 1907, 2019.

Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng. Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 899 916, 2017.

Panagiotis Symeonidis. Matrix and tensor decomposition in recommender systems. In Proceedings of the 10th ACM Conference on Recommender Systems, Rec Sys 16, pages 429 430, New York, NY, USA, 2016. ACM.

Panagiotis Symeonidis and Andreas Zioupos. Matrix and Tensor Factorization Techniques for Recommender Systems. Springer International Publishing, 1 edition, 2016.

Kean Ming Tan and Daniela Witten. Statistical properties of convex clustering. Electronic Journal of Statistics, 9:2324 2347, 2015.

Kean Ming Tan and Daniela M. Witten. Sparse biclustering of transposable data. Journal of Computational and Graphical Statistics, 23(4):985 1008, 2014.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267 288, 1996.

Chi, Gaines, Sun, Zhou, and Yang

Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411 423, 2001.

Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91 108, 2005.

Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335 1371, 2011.

Paul Tseng. Applications of a Splitting Algorithm to Decomposition in Convex Programming and Variational Inequalities. SIAM Journal on Control and Optimization, 29(1): 119 138, 1991.

Ledyard R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279 311, 1966.

Heather Turner, Trevor Bailey, and Wojtek Krzanowski. Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics and Data Analysis, 48(2):235 254, 2005.

Nick. Vannieuwenhoven, Raf. Vandebril, and Karl. Meerbergen. A new truncation strategy for the higher-order singular value decomposition. SIAM Journal on Scientiﬁc Computing, 34(2):A1027 A1052, 2012.

Nico Vervliet, Otto Debals, Laurent Sorber, Marc Van Barel, and Lieven De Lathauwer. Tensorlab 3.0, Mar. 2016. URL http://www.tensorlab.net. Available online.

Van Vu and Ke Wang. Random weighted projections, random quadratic forms and random eigenvectors. Random Structures and Algorithms Archive, 47(4):792 821, 2015.

Binhuan Wang, Yilong Zhang, Will Wei Sun, and Yixin Fang. Sparse Convex Clustering. Journal of Computational and Graphical Statistics, 27(2):393 403, 2018.

Yuxiang Wang, Huan Xu, and Chenlei Leng. Provable subspace clustering: When LRR meets SSC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 64 72. Curran Associates, Inc., 2013.

Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN 978-0-387-98140-6. URL http://ggplot2.org.

Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534, 2009.

Stephen J. Wright, Robert D. Nowak, and M ario A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479 2493, July 2009.

Provable Convex Co-clustering of Tensors

Chong Wu, Sunghoon Kwon, Xiaotong Shen, and Wei Pan. A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research, 17(188): 1 25, 2016a.

Tao Wu, Austin R. Benson, and David F. Gleich. General tensor spectral co-clustering for higher-order data. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2559 2567. Curran Associates, Inc., 2016b.

Shuo Xiang, Xiaoshen Tong, and Jieping Ye. Eﬃcient sparse group feature selection via nonconvex optimization. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 284 292, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR.

Or Yair, Ronen Talmon, Ronald R. Coifman, and Ioannis G. Kevrekidis. Reconstruction of normal forms by learning informed observation geometries from data. Proceedings of the National Academy of Sciences, 114(38):E7865 E7874, 2017.

Tatsuya Yokota, Namgil Lee, and Andrzej Cichocki. Robust multilinear tensor rank estimation using higher order singular value decomposition and information criteria. IEEE Transactions on Signal Processing, 65(5):1196 1206, 2017.

Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006.

Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1601 1608. MIT Press, 2005.

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 942, 2010.

Zhong-Yuan Zhang, Tao Li, and Chris Ding. Non-negative tri-factor tensor decomposition with applications. Knowledge and Information Systems, 34(2):243 265, 2013.

Hongya Zhao, Debby D. Wang, Long Chen, Xinyu Liu, and Hong Yan. Identifying multidimensional co-clusters in tensors based on hyperplane detection in singular vector spaces. PLOS ONE, 11(9):1 27, 09 2016.

Xiaolin Zheng, Weifeng Ding, Zhen Lin, and Chaochao Chen. Topic tensor factorization for recommender system. Information Sciences, 372(Supplement C):276 293, 2016.

Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108:540 552, 2013.

Changbo Zhu, Huan Xu, Chenlei Leng, and Shuicheng Yan. Convex optimization procedure for clustering: Theoretical revisit. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1619 1627. Curran Associates, Inc., 2014.

Chi, Gaines, Sun, Zhou, and Yang

Yunzhang Zhu, Xiaotong Shen, and Wei Pan. Simultaneous grouping pursuit and feature selection over an undirected graph. Journal of the American Statistical Association, 108 (502):713 725, 2013.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.

Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509 1533, 2008.