# fast_samplingbased_sketches_for_tensors__b1c19d3c.pdf Fast Sampling-Based Sketches for Tensors William Swartworth 1 David Woodruff 1 We introduce a new approach for applying sampling-based sketches to two and three mode tensors. We illustrate our technique to construct sketches for the classical problems of ℓ0 sampling and producing ℓ1 embeddings. In both settings we achieve sketches that can be applied to a rank one tensor in (Rd) q (for q = 2, 3) in time scaling with d rather than d2 or d3. Our main idea is a particular sampling construction based on fast convolution which allows us to quickly compute sums over sufficiently random subsets of tensor entries. 1. Introduction In the modern area of enormous data sets, space is often at a premium, and one would like algorithms that either avoid storing all available data, or that compress existing data. A common and widely-applied strategy is sketching. Given a vector x Rn consisting of the relevant data, a (linear) sketch of x is given by Sx where S is a linear map down to a dimension much smaller than n. Typically the goal is to design S so that some useful statistic of x can be computed from the sketched vector Sx, even though most of the information about x has been discarded. Linear sketches are particularly useful in the context of streaming algorithms, since linear updates to x can be translated to the sketch, simply by sketching the vector of updates, and adding it to the previous value of the sketch. Sketches have also found important applications in speeding up linear-algebraic (and related) computations (Woodruff et al., 2014). Here, the idea is to first apply a sketch to reduce the dimensionality of the problem, while approximately preserving a quantity of interest (e.g., a regression solution). Then standard algorithms Research of William Swartworth and David P. Woodruff was supported by a Simons Investigator Award, a Google Faculty Award, and NSF Grant No. CCF-2335412. 1Carnegie Mellon University. Correspondence to: William Swartworth , David P. Woodruff . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). can be applied to the smaller problem resulting in improved runtimes. A lot of work on sketching has focused on efficiently applying sketches to structured data. For example, if the underlying data is sparse, one might hope for a sketch that can be applied in input-sparsity time. A different type of structure that has been widely studied in this context is tensor structure. A line of work has been devoted to developing fast tensor sketches (Pham & Pagh, 2013; Ahle et al., 2020; Ahle & Knudsen, 2019; Meister et al., 2019), which are sketches that can be applied quickly to low-rank tensors. A rank-one tensor in Rn Rn for example (i.e., a rank-one matrix) only requires O(n) parameters to specify. However, a naive sketch might require O(n2) time to apply if each entry of the tensor must be calculated to form the sketch. The goal with tensor sketches is to do better then expanding into a vector form prior to sketching. Much of the work based on sketching tensors has focused on ℓ2-norm settings. The most studied example is Johnson Lindenstrauss (JL) embeddings, which have been studied extensively for tensors. For JL sketches, the ideal sketch is a dense Gaussian sketch. However, such sketches are slow to apply and so the main interest is in approximating the properties of a Gaussian sketch with a sketch that can be applied faster. We study a completely different type of sampling-based sketch, which is embodied by the ℓ0 sampling problem. The goal here is to observe a linear sketch of x, and then to (nearly) uniformly sample an element from x s support. A dense Gaussian sketch is completely useless as we would like a row of our sketch to single out an element of supp(x). To allow for this, the standard idea is to take a sketch which performs sparse samples at various scales. We ask: Are there sampling-based sketches which can be applied quickly to rank-one tensors? For tensors with two and three modes (which cover many tensors of practical importance), we provide a new approach for constructing sampling-based sketches in the tensor setting. To illustrate our approach, we focus on two fundamental problems: ℓ0 sampling, and ℓ1 embeddings. We recall the setup for these problems. Fast Sampling-Based Sketches for Tensors ℓ0 Sampling. For the ℓ0 sampling problem one would like to construct a distribution over sketching matrices S so that observing Sx allows us to return a uniformly random element from the support of x. We allow constant distortion, and δ failure probability, so conditioned on not failing (which must occur with probability at least 1 δ), we should have that for all i in the support of x, the sampling algorithm outputs i with probability in [c1/|supp(x)|, c2/|supp(X)|]. ℓ1 Embeddings. For the ℓ1 embedding problem, the goal is to construct a sketch S, such that for any x Rn, we have c1 x 1 Sx 1 c2 x 1 for absolute constants 0 < c1 < c2. 1.1. Our Results Our main idea is constructing a way to subsample a tensor at a given sampling rate p, and then to sum over the resulting sampling. Specifically we introduce a sampling primitive that we call a p-sample which samples each element with probability approximately p, and does so in a nearly pairwise independent manner. As we show, this will be sufficient for constructing both ℓ0-samplers and ℓ1 embeddings. Constructing p-samples is straightforward for tensors which are given entrywise one can simply sample each entry independently with probability p. Our key novelty is in showing that it is possible to construct p-samples for two and three mode tensors, which for rank-one tensors can be summed over in nearly linear time. We discuss the ideas for constructing samples below. Given our sampling primitive, we show how to construct ℓ0 samplers and ℓ1 embeddings. For the ℓ0 sampling problem we show the following result: Theorem 1.1. For q = 2, 3 there there is a linear sketch of X Rnq with sketching dimension m = O(log 1 δ log2 n(log log n + log 1 δ )) space, and a sampling algorithm that succeeds with probability 1 δ, which, conditioned on succeeding, outputs an index i in supp(X). Conditioned on succeeding, the probability that the algorithm outputs i supp(X) is in [ 0.5 |supp(X)|, 2 |supp(X)|]. The algorithm also returns the value of Xi. Moveover, the entries of the sketching matrix can be taken to be in {0, +1, 1}. For q = 2, our sketch can be applied to rank-one tensors in O(mn) time, and when q = 3 can be applied in O(mn log2 n) time. For ℓ1 embeddings, we show the following: Theorem 1.2. There is an O(1)-distortion ℓ1 embedding sketch S Rn Rm with sketching dimension m = O(log4 n + log2(1/δ) log n) that satisfies Sx 1 c x 1 with constant probability, and satisfies Sx 1 c x 1 with probability at least 1 δ. Moreover our sketch can be applied to rank one tensors in Rk Rk in O(mk) time, and to rank one tensors in Rk Rk Rk in O(mk log2 k) time. Our main novelty here lies in the p-sample construction, which is applied in a similar way as for our ℓ0-sample, although the details are more complicated. We therefore defer the proof of this result to the appendix. Regression We note that this error guarantee of no contraction with high probability, and no dilation with constant probability is know to be sufficient to construct algorithms for solving ℓ1 regression. In particular, by setting δ = exp( d), a standard net argument given in (Clarkson & Woodruff, 2014) for example, shows that no contraction holds with high probability over a d-dimensional subspace. As long as no dilation holds for the solution vector, then it is well-known that this yields a dimensionreduction for ℓ1 regression. Thus our ℓ1 sketch can be used to speed up sketching for an ℓ1 regression problem of the form minx Ax b 1 , where A has d columns, each of which have low-rank structure. Finally, while we choose to illustrate our approach on the problems of ℓ0 sampling and ℓ1 embeddings, we believe that this technique could be applied more generally whenever there is a known sketch that is built from random subsampling and taking random linear combinations within the sample. 1.2. Techniques The main idea behind both ℓ0-sampling and ℓ1-embeddings is to subsample coordinates at various scales in order to isolate some coordinates of x. As a warmup, we begin by describing ℓ0 sampling. Then we describe how our techniques extend to constructing ℓ1 embeddings. We note that our basic approach to constructing ℓ0 samplers and ℓ1 embeddings are not novel. However our ability to quickly sketch rank-one tensors is. ℓ0 Sampling In the general form of ℓ0 sampling, we are given a vector (or in our case a tensor) X, and the goal is to design a sketch that allows us sample an entry of X nearly uniformly from the support of X. The idea is to take a random sample S of X s entries and then to store a random linear combination of the values in S as an entry of the sketch. If S intersects supp(X) in a single element then this will allow us to recover a value in supp(X). If supp(X) has size k, then we would like S to sample each value with probability roughly 1/k in order to have a good probability of isolating a single element. Since k is unknown to us initially, the idea is to perform the sampling procedure Fast Sampling-Based Sketches for Tensors at each of log n sampling levels 1, 1 The challenge for us is to design samples S that allow for fast sketching on rank-one tensors, e.g., tensors of the form x y. As we later show, the step of computing a random linear combination of values in S can be achieved simply by randomly flipping the signs along each mode. The harder part is designing S so that we can sum the resulting tensor over S in roughly O(n) time. In other words we need S to be such that we can quickly compute P (i,j) S xiyj. If S just samples each entry of X i.i.d. with probability p, then it is not clear how to do better than computing the sum term-by-term, which would require Ω(pn2) time in our example. The next most natural thing to try is to take S = S1 S2 to be a random rectangle, since rectangles allow the above summation to be computed in O(n) time. To achieve sampling probability p one could take a rectangle of dimension pd pd, where the subsets of indices S1 and S2 corresponding to each dimension are chosen randomly. Thus, construction succeeds in sampling each entry of X with probability p; however this is not sufficient to isolate a single entry of supp(X) with good probability. For example supp(X) could consist of a single row of X. If we set p = 1/d then S has a good probability of sampling some entry from supp(X). However, when this occurs, S nearly always contains more than one such element. Indeed, S2 would typically contain around d elements, and in this situation it is not possible to sample a single entry from a row of X. One could try to fix this by randomizing the sampling scale on each side of the rectangle for a fixed sampling probability. In the case of ℓ0 sampling this would at least result in worse logarithmic factors in the sketching dimension than what we achieve. In the case of ℓ1 embeddings, it is unclear how to obtain a constant-factor embedding with this approach. For instance, in the two-mode case, a given sampling probability p can be realized by O(log d) choices of scales for the two rectangle dimensions. If a given level of X has size 1/p and is distributed uniformly throughout X, then all rectangles with sampling probability p will pick out single elements of that level. On the other hand, if the level is distributed along a single 1-dimensional fiber, then it might be the case that only one of the rectangles is good for that level. This suggests that the distortion would likely need to be Ω(log d) on some inputs. It does not seem clear how one could arrange arrange for an O(1) distortion ℓ1 embedding that handles these situations simultaneously. Surprisingly, at least for 2 and 3 mode tensors, it is possible to construct samples S that have better sampling properties than rectangles, but which can still be summed over in nearly linear time. For brevity, we refer to the sampling that we need as a p-sample, when it samples each index with probability (approximately) p. The idea is to design S in such a way that we can employ algorithms for fast convolution using the Fast Fourier Transform. For example, in the 2-mode case, one can compute X i+j [0,T 1] xiyj for a constant T (chosen to give the appropriate sampling probability) in O(n log n) time, by first calculating the convolution x y and then summing over the indices in [0, T 1]. This only gives a sum over a fixed set, but by randomly permuting the indices along each mode, and choosing T appropriately, this allows us to achieve sampling probabilities down to p = 1/n. Smaller sampling probabilities result in samples of size O(n), and so the sum can just be calculated explicitly. (As it turns out, in the 2-mode case, the runtime can be improved to O(n) by a simple optimization.) The 3-mode case is somewhat more complicated. A similar convolution trick involving a convolution of three vectors allows us to construct p-samples with fast summation down to 1/n. And sampling probabilities below 1/n2 allow for direct computation. What about sampling probabilities between 1/n2 and 1/n? Here, we show how to compute sums of the form X i+j+k=0,j i [0,T 1] xiyjzk in O(n log2 T) time. The rough approach is to show that we can reduce computing this quantity to n/T instances of multiplication by the top-left corner ({i, j : i j}) of a roughly T T Toeplitz matrix. Each top corner matrix can be decomposed into a sum of T log T Toeplitz matrices (up to some zero-padding) each of which has an O(T log T) multiplication time. ℓ1 embeddings For ℓ1 embeddings our approach roughly follows the sketch introduced by (Verbin & Zhang, 2012) for producing constant distortion ℓ1 embeddings. The main idea is to think of a vector x Rn as decomposed into approximate level sets, with the size of entries in each level set decreasing exponentially. For instance, when x 1 = 1, the level sets Li could be taken to contain entries in [q (i+1), q i] for some q < 1. For each level set, our sketch has an associated level with sampling probability larger than q, which is designed to capture most of the mass of Li. Since we would like our sampling level to capture the mass of Li with high probability, we need to oversample the entries of Li substantially. The usual approach here is to choose a fairly large q, and then hash the sampled values into separate buckets so as to minimize cancellations. This is typically done with Count Sketch; however it is not clear how to efficiently compose a Count Sketch with our fast sample constructions. Therefore we take a slightly different Fast Sampling-Based Sketches for Tensors approach. Instead of hashing each sampling level into T buckets, we simply take T independent samples, each with sampling probability q/T.. The analysis is quite similar to the analysis using Count Sketch, but we retain the the ability to quickly sketch rank-one tensors. However, we do pay a price our sketching time scales linearly with the size of the sketch. In contrast, for vectors, Count Sketch constructions allow for sketching in time that scales as O(n (number of sampling levels)). However these sketches are not efficient for rank-one tensors, since n would be replaced by n2 or n3. To avoid too much dilation, the rough idea is that each sampling level picks up roughly the right amount of mass from its corresponding Li with good probability. Also, large levels have a small numbers of elements and so they are unlikely to be picked up by a smaller sampling level. The main issue is in bounding the contribution of the Li s containing many small values. To avoid, this (Verbin & Zhang, 2012) introduced the idea of hashing each sampling level into buckets with random signs in order to induce cancellation among the contributions from such Li s. We employ the same technique here. However in order to make applying the signs efficient, our approach is to first apply random signs along the modes of the tensor, and then to sum over appropriate p-samples in order to compute a bucket . For a constant number of modes, this turns out to be enough randomness to induce cancellation. This outline for constructing ℓ1 embeddings has appeared in the literature several times. Our main novelty lies in constructing samples that admit fast linear combinations for rank-one tensors. Finally, one might wonder if our techniques are really necessary. We could hope to design fast sketches using Kroneckerstructured-sketches of the form S1 S2, which are easy to apply to a rank-one tensor x y one simply computes S1x S2y. It does not appear that a Kronecker-structured sketch can match our bounds. Indeed by a lower bound given in (Li et al., 2021) , an O(1) ℓ1 embedding for a single vector x Rn requires at least poly(n) sketching dimension for Kronecker-structured sketches. On the other hand, our sketch can still be applied to rank-one tensors in near-linear time, but requires only poly log n space to embed x. One could also ask whether Kahtri-Rao sketches (i.e., a sketch with each row of the sketch a rank one tensor) could be applied to match our sketching dimension. While we do not provide a lower bound against Kahtri-Rao measurements, we also do not see how to match our bounds for either ℓ0 sampling or ℓ1 embeddings. We leave the question of whether such a lower bound exists as an interesting open question. For larger q we also note that our ℓ1 embedding procedure could be applied to triples of modes at a time, giving a tree construction similar to (Ahle et al., 2020). Unfortunately however, the distortion grows like 2log3 q, and the success probability becomes c q by a union bound. Whether these parameters can be improved while allowing for fast sketching is an interesting problem. 1.3. Additional Open Questions An interesting open question is whether a similar set of ideas can be applied to tensors with a larger number of modes. For four modes, the most natural approach here would be to attempt to develop a fast sum that can sum over a (subset of) a codimension 2 subspaces of the tensor entries. For example, we might wish to calculate sums of the form X i,j,k,ℓ Q wixjykzℓ where Q is a set of entries satisfying two linear constraints. Unfortunately, we are not aware of a way to calculate such a sum faster than O(n2) time. It would be interesting to either give a fast summation algorithm, or to find a new technique that gets around this issue. 1.4. Related Work A substantial literature has been devoted to obtaining fast sketches for tensors. Tensor sketching was initiated in (Pham & Pagh, 2013) where it was shown that Count Sketch can be quickly applied to rank-one tensors using the Fast Fourier Transform. More recently, ℓ2 embeddings were constructed by (Ahle & Knudsen, 2019) and improved in (Ahle et al., 2020) with applications for sketching polynomial kernels. In earlier work (Indyk & Mc Gregor, 2008) gives a Kronecker-structured for the ℓ1 norm in the context of independence testing, using a variation on the previously-known Cauchy sketch (Indyk, 2006). We note that this sketch requires taking a median in order to recover the ℓ1 norm, and thus does not give an ℓ1 embedding, which may be more suitable for optimization applications. (Indyk, 2006) studies the problem of ℓ1 estimation using Cauchy sketches. Later, (Verbin & Zhang, 2012) gave a construction of an oblivious ℓ1 embedding. Our general approach for constructing ℓ1 embeddings is largely based on theirs. This approach is also generalized in (Clarkson & Woodruff, 2014) to M-estimators, and in particular is applied to construct ℓ1 (and more general) subspace embeddings. These bounds for ℓ1 embeddings were recently improved in (Munteanu et al., 2021). (Li et al., 2021) expanded on the work and considered independence testing for higher order tensors. They also give a poly(d) lower bound for constructing ℓ1 embeddings for a single vector in Rd using Kronecker-structured measurements. Fast Sampling-Based Sketches for Tensors 1.5. Notation and Preliminaries We use the notation [N] to refer to the set {1, . . . , N}. For two vectors x and y, the notation x y refers to the Kronecker product of x and y. We use the notation x y for the circular convolution of x and y. That is i+j=k xiyj, where the sum i+j is interpreted mod N. In such situations it is sometimes convenient to index vectors starting at 0 so we will do this occasionally. However unless otherwise stated, we index starting at 1. We typically identify a q-mode tensor with a vector in Rnq, assuming that the dimension along each tensor mode is n. We will typically make this assumption for the sake of convenience. We say that a tensor in Rnq has rank one if it can be expressed as x1 . . . xq for some vectors x1, . . . , xq. We will use c throughout to refer to an absolute constant, which might be different between uses (even within an equation). The notation O(f) means O(f logc f) for some constant c. ℓ0 sampling. ℓ0 sampling has received extensive attention. (Cormode & Firmani, 2014) surveys the standard recipe for constructing ℓ0-samplers, which we roughly follow here. (Woodruff & Zhang, 2018) considers solving the ℓ0sampling problem in a distributed setting for a product of two matrices which are held on different servers, however this is different from our setting, as their communication scheme is not a linear sketch. 2. p-sample constructions We first define our main sampling primitive, which we call a p-sample. This can be viewed as a slightly weaker version of a pairwise independent sample. Definition 2.1. Let T be an arbitrary finite set. We say that a random subset of S of T is a p-sample for T if 1. For all entries i T, p/2 Pr(i S) p. 2. For all j = i in T, Pr(j S|i S) 2p. Definition 2.2. Fix an n and m and suppose that S is a subset of [n]m = [n] . . . [n]. Let x(1), . . . , x(m) Rn be arbitrary vectors. We say that S admits fast summation if there is an algorithm which computes P (i1,...,im) S x(1) i1 x(2) i2 . . . x(m) im in O(mn) time. To perform one of our constructions we will need a particular type of Toeplitz-like fast matrix multiplication. Lemma 2.3. Let A Rn n be a matrix where Ai,j = Ai+1,j+2 for all 0 i n 1, 0 j n 2. Let B be defined by Bi,j = Ai,j if j i and Bi,j = 0 otherwise. Given v Rn the matrix product Av can be computed in O(n log n) time, and the product Bv can be computed in O(n log2 n) time. Proof. To see that A admits fast multiplication, note that the rows of A coincide with the even rows of a Toeplitz matrix of size 2n 2n. Since multiplication by a Toeplitz matrix can be carried out in O(n log n) time (Kailath & Sayed, 1999), the same holds for A. To get a fast algorithm for B we decompose it into a sum of matrices, each with the same structure of A up to some zero-padding. We call these matrices B1, B2 and B3. Take B1 to be B but with all indices outside of {(i, j) : 1 i n/2 , n/2 j n} replaced with 0. Visually, the support of A is a right triangle and the support of B1 corresponds to the largest square inscribed in A. Let B2 and B3 correspond to the two triangles that are left after removing the square. That is, B2 has support contained in {(i, j) : i > n 2 } and B3 has support contained in {(i, j) : j > n By construction, B = B1 + B2 + B3. Then B1 has the structure of A from the first part of the lemma, so admits O(n log n) time multiplication. After removing the zero padding, both B2 and B3 have the same structure as B but with half the dimension. Thus we have a runtime recurrence of the form T(n) = 2T( n/2 ) + O(n log n), which gives an overall runtime of O(n log2 n). Theorem 2.4. For m = 2, 3, and for all n and p, there exists a p-sample for [n]m that admits fast summation. Additionally 1. When m = 2 the summation runs in O(n) time 2. When m = 3 the summation runs in O(n log2 n) time Proof. We first show how to construct sets of indices that can be summed over quickly. We will then show how to use these sets to construct p-samples with fast summation. m=2 Constructions. Let T be an arbitrary positive integer, and consider the set AT = {(i, j) [n] [n] : i + j [T]} (Z/n)2. (Note that we are treating all indices as values in Z/n.) We will first show that AT admits O(n) summation for all T. Fast Sampling-Based Sketches for Tensors We would like to compute a sum of the form X i+j [T ] xiyj = X j:j i T xiyj = X j [1+i,T +i] yj. Let S(i) denote the value of the inner sum for a fixed i. Note that S(i + 1) = S(i) + y T +j+1 yi+1. Then S(0) can be computed in O(n) time and each of S(1), S(2), . . . , S(n 1) can be computed in turn, each in O(1) time using the recurrence. This gives an O(n) algorithm for computing the original sum. m=3 Constructions. Let T be arbitrary and set BT = {(i, j, k) : i + j + k [0, T 1]}.consider a sum of the form X (i,j,k) BT xiyjzk = X i+j+k=t xiyjzk. Each of the inner sums occurs as an entry in the circular convolution x y z. A circular convolution can be computed in O(n log n) time using the Fast Fourier Transform, and so BT can be computed in O(n log n) time. For m = 3, we also need a sparser construction. For this, define CT = {(i, j, k) : i + j + k = 0, j i [0, T 1]}. We are interested in the sum X (i,j,k) CT xiyjzk = X i+j+k=0,j i [T ] xiyjzk i+j=k,j i [T ] xiyj. As the algorithm for fast multiplication is slightly more involved we defer to Lemma 2.5. Constructing p-samples. In each case we will begin by applying a random function Pi : [n] [n] independently along each mode i. Then we will take the indices that land in either AT , BT , or CT . For AT we take the set of indices A T = {(i, j) : (P1(i), P2(j)) AT }. The probability that (i, j) AT is T/n since (i, j) is uniformly random. For (i, j) = (k, ℓ) the values of P1(i) + P2(j) and P1(k) + P2(ℓ) are independent. To see this, suppose without loss of generality that j = ℓ. Then conditioned on P1(i), P2(j) and P1(k), we have that P2(ℓ) is uniform over [n]. Therefore the events (P1(i), P2(j)) AT and (P1(k), P2(ℓ)) AT are independent, and so Pr((k, ℓ) A T |(i, j) A T ) = T/n. For BT , precisely the same construction and argument applies. CT requires a bit more work. This time we choose P1, P2, P3 to be random permutations. Define P by P((i, j, k)) = (P1(i), P2(j), P3(k)). As before, for any (i, j, k), (P1(i), P2(j), P3(k)) is uniformly random, so (P1(i), P2(j), P3(k)) CT with T/n2 probability. Now consider two pairs u1 = (i1, j1, k1) and u2 = (i2, j2, k2) with u1 = u2. Suppose that u1 and u2 differ in only a single coordinate. Then since P1, P2, P3 are permutations, Pr(P(u1) CT |P(U2) CT ) = 0, since CT intersects each single-mode fiber in at most one coordinate. Next we consider the case where u1 and u2 differ in precisely two coordinates. We start the case where u1 and u2 agree in the third coordinate. Without loss of generality, we assume that u1 = (0, 0, 0) and u2 = (1, 1, 0). Then Pr(P(u1) CT and P(u2) CT ) is the probability that the following events occur: 1. P1(0) + P2(0) = P3(0) 2. P2(0) P1(0) [0, T 1] 3. P1(1) + P2(1) = P3(0) 4. P2(1) P1(1) [0, T 1] For any fixed value of P3(0), there are exactly T pairs (P1(0), P2(0)) that satisfy the first two equations, and similarly there are T pairs (P1(1), P2(1)) that satisfy the last two equations. There are T(T 1) ways to choose the two pairs, so the probability that P1 and P2 give two such pairs is T(T 1)( 1 n 1 n 1 1 n 1 n 1). This implies that Pr(u2 CT |u1 CT ) = T 1 (n 1)2 . A similar calculation gives the same probability when u1 and u2 agree on precisely the first or second coordinates. Finally consider the case where u1 and u2 differ on all three coordinates. Then P1(u1) is uniform over all possible triples, and conditioned on P1(u1), P2(u2) is uniform over all triples that differ from P1(u1) in all coordinates. There are (n 1)3 such triples and at most Tn of these are in CT . So Pr(u2 CT |u1 CT ) Tn (n 1)3 . For two modes, we have constructed p-samples for p = n/T. This is sufficient to give a p-sample for all sampling probabilities down to 1/n. For sampling probabilities smaller than 1/n, the size of the sample will be O(n) with high probability, and so it admits fast summation by direct computation. Fast Sampling-Based Sketches for Tensors For three modes, our two constructions give p-samples down to sampling probability 1/n2. For smaller p, we can again compute the desired sum explicitly in O(n) time. We now verify that CT indeed admits fast summation. Lemma 2.5. Let T be a positive integer, and let x, y, z Rn. Let CT = {(i, j, k) : i+j +k = 0 and i j {0, . . . , T 1}, where the arithmetic operations are treated mod T. There is an algorithm that computes X (i,j,k) CT xiyjzk in O(n log2 T) time. For convenience, we zero-index into each vector. Proof. We first rewrite the sum as X (i,j,k) CT xiyjzk = X (i,j):i+j=k,i j {0,...,T 1} xiyj i:2i {k,...,k+T 1} xiyk i. Note that {i Z/T : 2i [k, k + T 1]} is a union of two intervals Ik and Jk in Z/T which are given by We now split the inner sum over Ik and Jk. We also split the outer sum over even and odd values of k so that the intervals shift by one with each term. This gives four sums to compute, each of which is similar. For the first sum, we wish to compute X k :0 2k