# fast_samplingbased_sketches_for_tensors__b1c19d3c.pdf

Fast Sampling-Based Sketches for Tensors

William Swartworth 1 David Woodruff 1

We introduce a new approach for applying sampling-based sketches to two and three mode tensors. We illustrate our technique to construct sketches for the classical problems of ℓ0 sampling and producing ℓ1 embeddings. In both settings we achieve sketches that can be applied to a rank one tensor in (Rd) q (for q = 2, 3) in time scaling with d rather than d2 or d3. Our main idea is a particular sampling construction based on fast convolution which allows us to quickly compute sums over sufficiently random subsets of tensor entries.

1. Introduction

In the modern area of enormous data sets, space is often at a premium, and one would like algorithms that either avoid storing all available data, or that compress existing data. A common and widely-applied strategy is sketching. Given a vector x Rn consisting of the relevant data, a (linear) sketch of x is given by Sx where S is a linear map down to a dimension much smaller than n. Typically the goal is to design S so that some useful statistic of x can be computed from the sketched vector Sx, even though most of the information about x has been discarded. Linear sketches are particularly useful in the context of streaming algorithms, since linear updates to x can be translated to the sketch, simply by sketching the vector of updates, and adding it to the previous value of the sketch. Sketches have also found important applications in speeding up linear-algebraic (and related) computations (Woodruff et al., 2014). Here, the idea is to first apply a sketch to reduce the dimensionality of the problem, while approximately preserving a quantity of interest (e.g., a regression solution). Then standard algorithms

Research of William Swartworth and David P. Woodruff was supported by a Simons Investigator Award, a Google Faculty Award, and NSF Grant No. CCF-2335412. 1Carnegie Mellon University. Correspondence to: William Swartworth <wswartwo@andrew.cmu.edu>, David P. Woodruff <dwoodruf@andrew.cmu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

can be applied to the smaller problem resulting in improved runtimes.

A lot of work on sketching has focused on efficiently applying sketches to structured data. For example, if the underlying data is sparse, one might hope for a sketch that can be applied in input-sparsity time. A different type of structure that has been widely studied in this context is tensor structure. A line of work has been devoted to developing fast tensor sketches (Pham & Pagh, 2013; Ahle et al., 2020; Ahle & Knudsen, 2019; Meister et al., 2019), which are sketches that can be applied quickly to low-rank tensors. A rank-one tensor in Rn Rn for example (i.e., a rank-one matrix) only requires O(n) parameters to specify. However, a naive sketch might require O(n2) time to apply if each entry of the tensor must be calculated to form the sketch. The goal with tensor sketches is to do better then expanding into a vector form prior to sketching.

Much of the work based on sketching tensors has focused on ℓ2-norm settings. The most studied example is Johnson Lindenstrauss (JL) embeddings, which have been studied extensively for tensors. For JL sketches, the ideal sketch is a dense Gaussian sketch. However, such sketches are slow to apply and so the main interest is in approximating the properties of a Gaussian sketch with a sketch that can be applied faster.

We study a completely different type of sampling-based sketch, which is embodied by the ℓ0 sampling problem. The goal here is to observe a linear sketch of x, and then to (nearly) uniformly sample an element from x s support. A dense Gaussian sketch is completely useless as we would like a row of our sketch to single out an element of supp(x). To allow for this, the standard idea is to take a sketch which performs sparse samples at various scales. We ask: Are there sampling-based sketches which can be applied quickly to rank-one tensors?

For tensors with two and three modes (which cover many tensors of practical importance), we provide a new approach for constructing sampling-based sketches in the tensor setting. To illustrate our approach, we focus on two fundamental problems: ℓ0 sampling, and ℓ1 embeddings. We recall the setup for these problems.

Fast Sampling-Based Sketches for Tensors

ℓ0 Sampling. For the ℓ0 sampling problem one would like to construct a distribution over sketching matrices S so that observing Sx allows us to return a uniformly random element from the support of x. We allow constant distortion, and δ failure probability, so conditioned on not failing (which must occur with probability at least 1 δ), we should have that for all i in the support of x, the sampling algorithm outputs i with probability in [c1/|supp(x)|, c2/|supp(X)|].

ℓ1 Embeddings. For the ℓ1 embedding problem, the goal is to construct a sketch S, such that for any x Rn, we have

c1 x 1 Sx 1 c2 x 1

for absolute constants 0 < c1 < c2.

1.1. Our Results

Our main idea is constructing a way to subsample a tensor at a given sampling rate p, and then to sum over the resulting sampling. Specifically we introduce a sampling primitive that we call a p-sample which samples each element with probability approximately p, and does so in a nearly pairwise independent manner. As we show, this will be sufficient for constructing both ℓ0-samplers and ℓ1 embeddings. Constructing p-samples is straightforward for tensors which are given entrywise one can simply sample each entry independently with probability p. Our key novelty is in showing that it is possible to construct p-samples for two and three mode tensors, which for rank-one tensors can be summed over in nearly linear time. We discuss the ideas for constructing samples below.

Given our sampling primitive, we show how to construct ℓ0 samplers and ℓ1 embeddings.

For the ℓ0 sampling problem we show the following result:

Theorem 1.1. For q = 2, 3 there there is a linear sketch of X Rnq with sketching dimension m = O(log 1

δ log2 n(log log n + log 1

δ )) space, and a sampling algorithm that succeeds with probability 1 δ, which, conditioned on succeeding, outputs an index i in supp(X). Conditioned on succeeding, the probability that the algorithm outputs i supp(X) is in [ 0.5 |supp(X)|, 2 |supp(X)|]. The algorithm also returns the value of Xi. Moveover, the entries of the sketching matrix can be taken to be in {0, +1, 1}.

For q = 2, our sketch can be applied to rank-one tensors in O(mn) time, and when q = 3 can be applied in O(mn log2 n) time.

For ℓ1 embeddings, we show the following:

Theorem 1.2. There is an O(1)-distortion ℓ1 embedding sketch S Rn Rm with sketching dimension m = O(log4 n + log2(1/δ) log n) that satisfies Sx 1 c x 1

with constant probability, and satisfies Sx 1 c x 1 with probability at least 1 δ.

Moreover our sketch can be applied to rank one tensors in Rk Rk in O(mk) time, and to rank one tensors in Rk Rk Rk in O(mk log2 k) time.

Our main novelty here lies in the p-sample construction, which is applied in a similar way as for our ℓ0-sample, although the details are more complicated. We therefore defer the proof of this result to the appendix.

Regression We note that this error guarantee of no contraction with high probability, and no dilation with constant probability is know to be sufficient to construct algorithms for solving ℓ1 regression. In particular, by setting δ = exp( d), a standard net argument given in (Clarkson & Woodruff, 2014) for example, shows that no contraction holds with high probability over a d-dimensional subspace. As long as no dilation holds for the solution vector, then it is well-known that this yields a dimensionreduction for ℓ1 regression. Thus our ℓ1 sketch can be used to speed up sketching for an ℓ1 regression problem of the form minx Ax b 1 , where A has d columns, each of which have low-rank structure.

Finally, while we choose to illustrate our approach on the problems of ℓ0 sampling and ℓ1 embeddings, we believe that this technique could be applied more generally whenever there is a known sketch that is built from random subsampling and taking random linear combinations within the sample.

1.2. Techniques

The main idea behind both ℓ0-sampling and ℓ1-embeddings is to subsample coordinates at various scales in order to isolate some coordinates of x. As a warmup, we begin by describing ℓ0 sampling. Then we describe how our techniques extend to constructing ℓ1 embeddings. We note that our basic approach to constructing ℓ0 samplers and ℓ1 embeddings are not novel. However our ability to quickly sketch rank-one tensors is.

ℓ0 Sampling In the general form of ℓ0 sampling, we are given a vector (or in our case a tensor) X, and the goal is to design a sketch that allows us sample an entry of X nearly uniformly from the support of X. The idea is to take a random sample S of X s entries and then to store a random linear combination of the values in S as an entry of the sketch. If S intersects supp(X) in a single element then this will allow us to recover a value in supp(X). If supp(X) has size k, then we would like S to sample each value with probability roughly 1/k in order to have a good probability of isolating a single element. Since k is unknown to us initially, the idea is to perform the sampling procedure

Fast Sampling-Based Sketches for Tensors

at each of log n sampling levels 1, 1

The challenge for us is to design samples S that allow for fast sketching on rank-one tensors, e.g., tensors of the form x y. As we later show, the step of computing a random linear combination of values in S can be achieved simply by randomly flipping the signs along each mode. The harder part is designing S so that we can sum the resulting tensor over S in roughly O(n) time. In other words we need S to be such that we can quickly compute P

(i,j) S xiyj. If S just samples each entry of X i.i.d. with probability p, then it is not clear how to do better than computing the sum term-by-term, which would require Ω(pn2) time in our example.

The next most natural thing to try is to take S = S1 S2 to be a random rectangle, since rectangles allow the above summation to be computed in O(n) time. To achieve sampling probability p one could take a rectangle of dimension pd pd, where the subsets of indices S1 and S2 corresponding to each dimension are chosen randomly. Thus, construction succeeds in sampling each entry of X with probability p; however this is not sufficient to isolate a single entry of supp(X) with good probability. For example supp(X) could consist of a single row of X. If we set p = 1/d then S has a good probability of sampling some entry from supp(X). However, when this occurs, S nearly always contains more than one such element. Indeed, S2 would typically contain around

d elements, and in this situation it is not possible to sample a single entry from a row of X. One could try to fix this by randomizing the sampling scale on each side of the rectangle for a fixed sampling probability. In the case of ℓ0 sampling this would at least result in worse logarithmic factors in the sketching dimension than what we achieve. In the case of ℓ1 embeddings, it is unclear how to obtain a constant-factor embedding with this approach. For instance, in the two-mode case, a given sampling probability p can be realized by O(log d) choices of scales for the two rectangle dimensions. If a given level of X has size 1/p and is distributed uniformly throughout X, then all rectangles with sampling probability p will pick out single elements of that level. On the other hand, if the level is distributed along a single 1-dimensional fiber, then it might be the case that only one of the rectangles is good for that level. This suggests that the distortion would likely need to be Ω(log d) on some inputs. It does not seem clear how one could arrange arrange for an O(1) distortion ℓ1 embedding that handles these situations simultaneously.

Surprisingly, at least for 2 and 3 mode tensors, it is possible to construct samples S that have better sampling properties than rectangles, but which can still be summed over in nearly linear time. For brevity, we refer to the sampling that we need as a p-sample, when it samples each index with probability (approximately) p. The idea is to design S in

such a way that we can employ algorithms for fast convolution using the Fast Fourier Transform. For example, in the 2-mode case, one can compute X

i+j [0,T 1] xiyj

for a constant T (chosen to give the appropriate sampling probability) in O(n log n) time, by first calculating the convolution x y and then summing over the indices in [0, T 1]. This only gives a sum over a fixed set, but by randomly permuting the indices along each mode, and choosing T appropriately, this allows us to achieve sampling probabilities down to p = 1/n. Smaller sampling probabilities result in samples of size O(n), and so the sum can just be calculated explicitly. (As it turns out, in the 2-mode case, the runtime can be improved to O(n) by a simple optimization.)

The 3-mode case is somewhat more complicated. A similar convolution trick involving a convolution of three vectors allows us to construct p-samples with fast summation down to 1/n. And sampling probabilities below 1/n2 allow for direct computation. What about sampling probabilities between 1/n2 and 1/n? Here, we show how to compute sums of the form X

i+j+k=0,j i [0,T 1] xiyjzk

in O(n log2 T) time. The rough approach is to show that we can reduce computing this quantity to n/T instances of multiplication by the top-left corner ({i, j : i j}) of a roughly T T Toeplitz matrix. Each top corner matrix can be decomposed into a sum of T log T Toeplitz matrices (up to some zero-padding) each of which has an O(T log T) multiplication time.

ℓ1 embeddings For ℓ1 embeddings our approach roughly follows the sketch introduced by (Verbin & Zhang, 2012) for producing constant distortion ℓ1 embeddings. The main idea is to think of a vector x Rn as decomposed into approximate level sets, with the size of entries in each level set decreasing exponentially. For instance, when x 1 = 1, the level sets Li could be taken to contain entries in [q (i+1), q i] for some q < 1. For each level set, our sketch has an associated level with sampling probability larger than q, which is designed to capture most of the mass of Li. Since we would like our sampling level to capture the mass of Li with high probability, we need to oversample the entries of Li substantially. The usual approach here is to choose a fairly large q, and then hash the sampled values into separate buckets so as to minimize cancellations. This is typically done with Count Sketch; however it is not clear how to efficiently compose a Count Sketch with our fast sample constructions. Therefore we take a slightly different

Fast Sampling-Based Sketches for Tensors

approach. Instead of hashing each sampling level into T buckets, we simply take T independent samples, each with sampling probability q/T.. The analysis is quite similar to the analysis using Count Sketch, but we retain the the ability to quickly sketch rank-one tensors. However, we do pay a price our sketching time scales linearly with the size of the sketch. In contrast, for vectors, Count Sketch constructions allow for sketching in time that scales as O(n (number of sampling levels)). However these sketches are not efficient for rank-one tensors, since n would be replaced by n2 or n3.

To avoid too much dilation, the rough idea is that each sampling level picks up roughly the right amount of mass from its corresponding Li with good probability. Also, large levels have a small numbers of elements and so they are unlikely to be picked up by a smaller sampling level. The main issue is in bounding the contribution of the Li s containing many small values. To avoid, this (Verbin & Zhang, 2012) introduced the idea of hashing each sampling level into buckets with random signs in order to induce cancellation among the contributions from such Li s. We employ the same technique here. However in order to make applying the signs efficient, our approach is to first apply random signs along the modes of the tensor, and then to sum over appropriate p-samples in order to compute a bucket . For a constant number of modes, this turns out to be enough randomness to induce cancellation.

This outline for constructing ℓ1 embeddings has appeared in the literature several times. Our main novelty lies in constructing samples that admit fast linear combinations for rank-one tensors.

Finally, one might wonder if our techniques are really necessary. We could hope to design fast sketches using Kroneckerstructured-sketches of the form S1 S2, which are easy to apply to a rank-one tensor x y one simply computes S1x S2y. It does not appear that a Kronecker-structured sketch can match our bounds. Indeed by a lower bound given in (Li et al., 2021) , an O(1) ℓ1 embedding for a single vector x Rn requires at least poly(n) sketching dimension for Kronecker-structured sketches. On the other hand, our sketch can still be applied to rank-one tensors in near-linear time, but requires only poly log n space to embed x. One could also ask whether Kahtri-Rao sketches (i.e., a sketch with each row of the sketch a rank one tensor) could be applied to match our sketching dimension. While we do not provide a lower bound against Kahtri-Rao measurements, we also do not see how to match our bounds for either ℓ0 sampling or ℓ1 embeddings. We leave the question of whether such a lower bound exists as an interesting open question.

For larger q we also note that our ℓ1 embedding procedure could be applied to triples of modes at a time, giving a tree

construction similar to (Ahle et al., 2020). Unfortunately however, the distortion grows like 2log3 q, and the success probability becomes c q by a union bound. Whether these parameters can be improved while allowing for fast sketching is an interesting problem.

1.3. Additional Open Questions

An interesting open question is whether a similar set of ideas can be applied to tensors with a larger number of modes. For four modes, the most natural approach here would be to attempt to develop a fast sum that can sum over a (subset of) a codimension 2 subspaces of the tensor entries. For example, we might wish to calculate sums of the form X

i,j,k,ℓ Q wixjykzℓ

where Q is a set of entries satisfying two linear constraints. Unfortunately, we are not aware of a way to calculate such a sum faster than O(n2) time. It would be interesting to either give a fast summation algorithm, or to find a new technique that gets around this issue.

1.4. Related Work

A substantial literature has been devoted to obtaining fast sketches for tensors. Tensor sketching was initiated in (Pham & Pagh, 2013) where it was shown that Count Sketch can be quickly applied to rank-one tensors using the Fast Fourier Transform. More recently, ℓ2 embeddings were constructed by (Ahle & Knudsen, 2019) and improved in (Ahle et al., 2020) with applications for sketching polynomial kernels.

In earlier work (Indyk & Mc Gregor, 2008) gives a Kronecker-structured for the ℓ1 norm in the context of independence testing, using a variation on the previously-known Cauchy sketch (Indyk, 2006). We note that this sketch requires taking a median in order to recover the ℓ1 norm, and thus does not give an ℓ1 embedding, which may be more suitable for optimization applications.

(Indyk, 2006) studies the problem of ℓ1 estimation using Cauchy sketches. Later, (Verbin & Zhang, 2012) gave a construction of an oblivious ℓ1 embedding. Our general approach for constructing ℓ1 embeddings is largely based on theirs. This approach is also generalized in (Clarkson & Woodruff, 2014) to M-estimators, and in particular is applied to construct ℓ1 (and more general) subspace embeddings. These bounds for ℓ1 embeddings were recently improved in (Munteanu et al., 2021).

(Li et al., 2021) expanded on the work and considered independence testing for higher order tensors. They also give a poly(d) lower bound for constructing ℓ1 embeddings for a single vector in Rd using Kronecker-structured measurements.

Fast Sampling-Based Sketches for Tensors

1.5. Notation and Preliminaries

We use the notation [N] to refer to the set {1, . . . , N}. For two vectors x and y, the notation x y refers to the Kronecker product of x and y. We use the notation x y for the circular convolution of x and y. That is

i+j=k xiyj,

where the sum i+j is interpreted mod N. In such situations it is sometimes convenient to index vectors starting at 0 so we will do this occasionally. However unless otherwise stated, we index starting at 1.

We typically identify a q-mode tensor with a vector in Rnq, assuming that the dimension along each tensor mode is n. We will typically make this assumption for the sake of convenience. We say that a tensor in Rnq has rank one if it can be expressed as x1 . . . xq for some vectors x1, . . . , xq.

We will use c throughout to refer to an absolute constant, which might be different between uses (even within an equation).

The notation O(f) means O(f logc f) for some constant c.

ℓ0 sampling. ℓ0 sampling has received extensive attention. (Cormode & Firmani, 2014) surveys the standard recipe for constructing ℓ0-samplers, which we roughly follow here.

(Woodruff & Zhang, 2018) considers solving the ℓ0sampling problem in a distributed setting for a product of two matrices which are held on different servers, however this is different from our setting, as their communication scheme is not a linear sketch.

2. p-sample constructions

We first define our main sampling primitive, which we call a p-sample. This can be viewed as a slightly weaker version of a pairwise independent sample. Definition 2.1. Let T be an arbitrary finite set. We say that a random subset of S of T is a p-sample for T if

1. For all entries i T,

p/2 Pr(i S) p.

2. For all j = i in T,

Pr(j S|i S) 2p.

Definition 2.2. Fix an n and m and suppose that S is a subset of [n]m = [n] . . . [n]. Let x(1), . . . , x(m) Rn be arbitrary vectors. We say that S admits fast summation if there is an algorithm which computes P

(i1,...,im) S x(1) i1 x(2) i2 . . . x(m) im in O(mn) time.

To perform one of our constructions we will need a particular type of Toeplitz-like fast matrix multiplication.

Lemma 2.3. Let A Rn n be a matrix where Ai,j = Ai+1,j+2 for all 0 i n 1, 0 j n 2. Let B be defined by Bi,j = Ai,j if j i and Bi,j = 0 otherwise.

Given v Rn the matrix product Av can be computed in O(n log n) time, and the product Bv can be computed in O(n log2 n) time.

Proof. To see that A admits fast multiplication, note that the rows of A coincide with the even rows of a Toeplitz matrix of size 2n 2n. Since multiplication by a Toeplitz matrix can be carried out in O(n log n) time (Kailath & Sayed, 1999), the same holds for A.

To get a fast algorithm for B we decompose it into a sum of matrices, each with the same structure of A up to some zero-padding. We call these matrices B1, B2 and B3. Take B1 to be B but with all indices outside of {(i, j) : 1 i n/2 , n/2 j n} replaced with 0. Visually, the support of A is a right triangle and the support of B1 corresponds to the largest square inscribed in A. Let B2 and B3 correspond to the two triangles that are left after removing the square. That is, B2 has support contained in {(i, j) : i > n

2 } and B3 has support contained in {(i, j) : j > n

By construction, B = B1 + B2 + B3. Then B1 has the structure of A from the first part of the lemma, so admits O(n log n) time multiplication. After removing the zero padding, both B2 and B3 have the same structure as B but with half the dimension. Thus we have a runtime recurrence of the form T(n) = 2T( n/2 ) + O(n log n), which gives an overall runtime of O(n log2 n).

Theorem 2.4. For m = 2, 3, and for all n and p, there exists a p-sample for [n]m that admits fast summation. Additionally

1. When m = 2 the summation runs in O(n) time

2. When m = 3 the summation runs in O(n log2 n) time

Proof. We first show how to construct sets of indices that can be summed over quickly. We will then show how to use these sets to construct p-samples with fast summation.

m=2 Constructions. Let T be an arbitrary positive integer, and consider the set

AT = {(i, j) [n] [n] : i + j [T]} (Z/n)2.

(Note that we are treating all indices as values in Z/n.) We will first show that AT admits O(n) summation for all T.

Fast Sampling-Based Sketches for Tensors

We would like to compute a sum of the form X

i+j [T ] xiyj = X

j:j i T xiyj = X

j [1+i,T +i] yj.

Let S(i) denote the value of the inner sum for a fixed i. Note that S(i + 1) = S(i) + y T +j+1 yi+1. Then S(0) can be computed in O(n) time and each of S(1), S(2), . . . , S(n 1) can be computed in turn, each in O(1) time using the recurrence. This gives an O(n) algorithm for computing the original sum.

m=3 Constructions. Let T be arbitrary and set BT = {(i, j, k) : i + j + k [0, T 1]}.consider a sum of the form X

(i,j,k) BT xiyjzk = X

i+j+k=t xiyjzk.

Each of the inner sums occurs as an entry in the circular convolution x y z. A circular convolution can be computed in O(n log n) time using the Fast Fourier Transform, and so BT can be computed in O(n log n) time.

For m = 3, we also need a sparser construction. For this, define CT = {(i, j, k) : i + j + k = 0, j i [0, T 1]}. We are interested in the sum X

(i,j,k) CT xiyjzk = X

i+j+k=0,j i [T ] xiyjzk

i+j=k,j i [T ] xiyj.

As the algorithm for fast multiplication is slightly more involved we defer to Lemma 2.5.

Constructing p-samples. In each case we will begin by applying a random function Pi : [n] [n] independently along each mode i. Then we will take the indices that land in either AT , BT , or CT .

For AT we take the set of indices

A T = {(i, j) : (P1(i), P2(j)) AT }.

The probability that (i, j) AT is T/n since (i, j) is uniformly random. For (i, j) = (k, ℓ) the values of P1(i) + P2(j) and P1(k) + P2(ℓ) are independent. To see this, suppose without loss of generality that j = ℓ. Then conditioned on P1(i), P2(j) and P1(k), we have that P2(ℓ) is uniform over [n]. Therefore the events (P1(i), P2(j)) AT and (P1(k), P2(ℓ)) AT are independent, and so

Pr((k, ℓ) A T |(i, j) A T ) = T/n.

For BT , precisely the same construction and argument applies.

CT requires a bit more work. This time we choose P1, P2, P3 to be random permutations. Define P by P((i, j, k)) = (P1(i), P2(j), P3(k)). As before, for any (i, j, k), (P1(i), P2(j), P3(k)) is uniformly random, so (P1(i), P2(j), P3(k)) CT with T/n2 probability. Now consider two pairs u1 = (i1, j1, k1) and u2 = (i2, j2, k2) with u1 = u2. Suppose that u1 and u2 differ in only a single coordinate. Then since P1, P2, P3 are permutations,

Pr(P(u1) CT |P(U2) CT ) = 0,

since CT intersects each single-mode fiber in at most one coordinate.

Next we consider the case where u1 and u2 differ in precisely two coordinates. We start the case where u1 and u2 agree in the third coordinate. Without loss of generality, we assume that u1 = (0, 0, 0) and u2 = (1, 1, 0). Then Pr(P(u1) CT and P(u2) CT ) is the probability that the following events occur:

1. P1(0) + P2(0) = P3(0)

2. P2(0) P1(0) [0, T 1]

3. P1(1) + P2(1) = P3(0)

4. P2(1) P1(1) [0, T 1]

For any fixed value of P3(0), there are exactly T pairs (P1(0), P2(0)) that satisfy the first two equations, and similarly there are T pairs (P1(1), P2(1)) that satisfy the last two equations. There are T(T 1) ways to choose the two pairs, so the probability that P1 and P2 give two such pairs is T(T 1)( 1

n 1 n 1 1 n 1 n 1). This implies that

Pr(u2 CT |u1 CT ) = T 1 (n 1)2 .

A similar calculation gives the same probability when u1 and u2 agree on precisely the first or second coordinates.

Finally consider the case where u1 and u2 differ on all three coordinates. Then P1(u1) is uniform over all possible triples, and conditioned on P1(u1), P2(u2) is uniform over all triples that differ from P1(u1) in all coordinates. There are (n 1)3 such triples and at most Tn of these are in CT . So Pr(u2 CT |u1 CT ) Tn (n 1)3 .

For two modes, we have constructed p-samples for p = n/T. This is sufficient to give a p-sample for all sampling probabilities down to 1/n. For sampling probabilities smaller than 1/n, the size of the sample will be O(n) with high probability, and so it admits fast summation by direct computation.

Fast Sampling-Based Sketches for Tensors

For three modes, our two constructions give p-samples down to sampling probability 1/n2. For smaller p, we can again compute the desired sum explicitly in O(n) time.

We now verify that CT indeed admits fast summation.

Lemma 2.5. Let T be a positive integer, and let x, y, z Rn. Let

CT = {(i, j, k) : i+j +k = 0 and i j {0, . . . , T 1},

where the arithmetic operations are treated mod T. There is an algorithm that computes X

(i,j,k) CT xiyjzk

in O(n log2 T) time. For convenience, we zero-index into each vector.

Proof. We first rewrite the sum as X

(i,j,k) CT xiyjzk = X

(i,j):i+j=k,i j {0,...,T 1} xiyj

i:2i {k,...,k+T 1} xiyk i.

Note that {i Z/T : 2i [k, k + T 1]} is a union of two intervals Ik and Jk in Z/T which are given by

We now split the inner sum over Ik and Jk. We also split the outer sum over even and odd values of k so that the intervals shift by one with each term. This gives four sums to compute, each of which is similar. For the first sum, we wish to compute X

k :0 2k <T z 2k X

i I2k xiyk i

k :0 2k <T z 2k X

i k +I0 xiy2k i.

We will show how to compute all of the inner sums quickly. Define the stride-2 circulant matrix Y Rn n by Ykj = y2k j. Define Y kj by Y kj = Ykj for j k + I0 and 0 otherwise. The sums we wish to evaluate are precisely the entries of Y x by construction.

Now we show how to evaluate Y x. By increasing the dimension of Y and x by at most T (and continuing the circulant pattern), we may assume without loss of generality that n is a multiple of |I0|.

The support of Y contains a stripe of width |I0| running parallel to the diagonal. This stripe decomposes into a union of 2n/T right triangles with legs of length |I0|, and with n/T in each of two orientations. More formally, by permuting rows, we may assume that I0 = [0, a) where a = |I0|. Then the triangles described are the restrictions of Y to sets of the form [ra, (r + 1)a] [sa, (s + 1)a]. Each such triangular submatrix is of the form considered in Lemma 2.3, so admits O(|I0| log2 |I0|) = O(T log2 T) matrix-vector multiplication. Since there are 2n/T such submatrices of Y , matrix vector multiplication with Y runs in O(n log2 T) time. This allows the sum in question to be computed in the same time.

Each of the other three sums can be evaluated similarly and gives the same runtime. So the runtime stated in the lemma follows.

We also collect a couple of basic facts that follow from the definition of a p-sample.

Proposition 2.6. Let S be a p-sample for x, and let L be a subset of indices.

1. For all i in L, we have S L = {i} with probability at least 1

2p(1 2p|L|).

2. With probability at least 1

2|L|p(1 2p|L|), |S L| = 1.

Proof. Consider a fixed index i in L. With probability at least p/2 we have i S. Conditioned on i S, we have Pr(j S) 2p for all j = i. Taking a union bound over all j L gives Pr(S L = {i}) p/2(1 2p|L|). The events S L = {i} are mutually exclusive as i ranges over L, so the proposition follows.

3. Constructing an ℓ0 sampler.

Before giving the ℓ0-sampler we give two basic constructions for 1-sparse recovery and singleton testing using Kahtri-Rao structured measurements. We want Kahtri-Rao measurements here, so that we can apply our random sign flips modewise.

Lemma 3.1. Let X Rnq. Suppose that σ(i) j for i [N] and j [q] are random sign vectors. For N c log n log 1

the measurements D σ(i) 1 . . . σ(i) q , X E are sufficient to

1. recover X with 1 δ probability if X is 1-sparse1.

2. to distinguish between X being 1-sparse and X having at least two nonzero entries with at least 1 δ probability.

1X being 1-sparse means that X has at most one nonzero entry

Fast Sampling-Based Sketches for Tensors

Proof. First note that if X is 1-sparse then all measurements will be equal in absolute value.

Now, suppose that X is 1-sparse with k in supp(X). Say that the sign pattern for k is the sequence of signs of D σ(i) 1 . . . σ(i) q , X E for i [N]. We can recover k as long as its sign pattern is unique among indices, up to inverting all the signs (since an entry of X could be either positive or negative). The entries of each measurement are pairwise independent, so the probability that k1 and k2 have the same sign on a given measurement is 1/2. The probability that they have the same sign on each of N measurements is 1/2N, so the probability that some other index has the same sign pattern as k, up to inverting signs, is at most 2 nq/2N

which is at most δ for N log 2nq

δ . So if X is 1-sparse, then we recover X with 1 δ probability.

To test if X is 1-sparse we first run the 1-sparse recovery algorithm to obtain a 1-sparse candidate Y for X (or if the recovery fails, then we immediately return False). Now we would like to test if X Y is 0. To do this, we simply take additional sign measurements in the form of those in the lemma, and test if they all give 0. Clearly if X Y = 0 then this will occur. Now consider the case where X Y = 0. Consider the natural compression R q R (q 1)

obtained by applying σq along the q-th mode. If X Y is non-zero then it has a 1-dimensional fiber v along mode q which is non-zero. It is clear that v, σq is non-zero with probability at least 1/2. Therefor applying this inductively shows that each measurement D σ(i) 1 . . . σ(i) q , X Y E

is non-zero with probability at least 1/2q. So O(2q log 1

δ ) additional measurements suffice.

Remark 3.2. The above algorithm shows that there exists a collection of sign measurements that allows for 1-sparse recovery with probability 1. Thus such a collection of signs only needs to be sampled once, and it will be valid with probability 1 δ. For the equality testing step, the randomness is essential. However, conditioned on having a no-failure 1-sparse recovery algorithm, we note that it yields no false negatives. That is, if X is 1-sparse, then the tester will always accept.

Our next lemma for ℓ0 sampling shows that if we succeed in isolating a single element of the support of X at an O(1/|supp(X)|) sampling level, then that element is close to uniform from the support.

Lemma 3.3. Let S be a p-sample, where p 1 4|supp(X)|. Then for all i supp(X),

Pr(i S |S supp(X)| = 1) 0.25 |supp(x)|, 4 |supp(X)|

Proof. See the supplementary.

We are now ready to give our construction of an ℓ0 sampler.

Theorem 3.4. For q = 2, 3 there there is a linear sketch of X Rnq with sketching dimension m = O(log 1

δ log2 n(log log n + log 1

δ )) space, and a sampling algorithm that succeeds with probability 1 δ, which, conditioned on succeeding, outputs an index i in supp(X). Conditioned on succeeding, the probability that the algorithm outputs i supp(X) is in [ 0.5 |supp(X)|, 2 |supp(X)|]. The algorithm also returns the value of Xi. Moveover, the entries of the sketching matrix can be taken to be in {0, +1, 1}.

For q = 2, our sketch can be applied to rank-one tensors in O(mn) time, and when q = 3 can be applied in O(mn log2 n) time.

Proof. Our sketch consists of O(log nq) p-samples with sampling probabilities p = 1, 1

4 . . . , 1

nq . Additionally, we take k independent such samples for each p which we denote by S(p) 1 , . . . , S(p) k .

There is some p in [ 1 8|supp(X)|, 1 4|supp(X)|. For this value of p

and fixed i, we have that S(p) i supp(X) = 1 with probability at least 1/8 by Proposition 2.6. Thus, the probability that this holds for some i is at least 1 (7/8)k, which is at least 1 δ for k log 1

Our ℓ0 sampling algorithm is to to iterate through the samples in increasing order of p, and within each level to iterate through the S(p) i s in increasing order of i. For each (p, i) we run the singleton tester from Lemma 3.1. If the singleton tester accepts, then we use the 1-sparse recovery scheme to return the appropriate index and value. As we have seen, with probability at least 1 δ, the algorithm will return for some p 1 4|supp(X)|. We also use the optimization of Remark 3.2, so that we can assume the 1-sparse tester never yields false negatives.

As long as this occurs, the algorithm will be successful by the previous Lemma 3.3 unless the first S(p) i for which the singleton accepts is not 1-sparse. Call this event E. If E occurs, then the singleton tester must fail on one of O(k log n) trials, so we would like to make its failure probability at most O(δ/(k log n). By Lemma 3.1 this can be accomplished using a singleton-testing sketch of size O(log n log(k log n/δ)) which is O(log n log log n + log n log 1

δ ) when k = O(log 1

Now, we have k singleton sketches for each of log n sampling probabilities p, resulting in a total sketching dimension of O(k log2 n(log log n + log 1

δ )) = O(log 1

δ log2 n(log log n + log 1

Fast Sampling-Based Sketches for Tensors

3.1. Conclusion and Future Work

We gave a new approach to constructing sampling-based sketches which can be quickly applied to rank-one tensors with at most 3 modes. To demonstrate the utility of this approach we showed how to speed up sketching for ℓ0 estimation, and for constructing ℓ1 embeddings.

A number of intriguing questions remain. For example, are there other constructions of p-samples that apply to higher mode tensors? Additionally, our sketches require O(n) time to construct each entry of the sketch. While this can be faster than the O(n2) or O(n3) required to expand a rankone tensor, it seems reasonable to hope for sketches that takes closer to O(1) time per entry. This is the case for Count Sketch based constructions that work by sampling and then hashing entries of the sample into buckets. Unfortunately in our setting it is not clear how to efficiently compose our p-samples with a Count Sketch.

Finally, it would interesting to extend our ideas to other problems where sampling-based sketches have been applied.

4. Experiments

We evaluate the correctness of the ℓ0 samplers. Our ℓ0sampler is theoretically guaranteed to output a uniformly random entry of the support, up to some constant factor. That is, the probability we output a fixed entry of the support of a tensor X is [ c1 |supp(X)|, c2 |supp(X)|] for some absolute constants c1 and c2. In order to keep our analysis simple, the constants c1 and c2 gotten from unwinding our proof are more extreme than necessary. We remedy this by empirically showing that our sampler is in fact much closer to uniform.

We choose our ℓ0 sampler to have 10 buckets at each sampling level. For an N N N tensor X our sampling rates begin at 1/N 3 and increase in powers of 5. While these parameters are somewhat differnt from what we chose theoretically, we find that they give good practical results.

In all experiments we work with 3-mode tensors. We consider two different support structures for our experiments. The first support shape we consider, we call the disjoint rectangle support. In this model, the support consists of two boxes B1 and B2 of dimensions (x1, x2, x3) and (y1, y2, y3), such that

Recall that the first step of p-sample construction is to permute each mode separately, so up to permutation there is only one configuration of rectangles to consider. Moreover, up to this permutation symmetry, all entries of B1 are equivalent to one another, and all entries of B2 are equivalent to one another. Thus to test for uniformity, it suffices to check that the number of samples from B1 is approximately |B1|/(|B1| + |B2|).

The second support shape that we consider is a rectangle B of dimensions (x1, x2, x3), along with an additional x1x2x3 random entries (sampled from outside the rectangle) which we call R, the random component of the support. Note that we choose the size of R to be the same as the size of B. This is mainly for convenience, so that our ℓ0 tester is successful if approximately half the samples lie in R.

In each of experiments we use the p-sample construction that we previously discussed to build an ℓ0 tester. To make the experiments run faster, we assume access to a perfect singleton tester. While such a singleton tester does not exist in practice, it can be approximated arbitrarily well using ideas that we previously discussed. Here our aim is to understand how well our p-sample construction works as a proxy for a truly uniform sample.

All of our experiments suggest that our ℓ0 sampling procedure behaves very nearly perfectly on the tensors described above. All experiments are accurate to within a few percent of what one would expect for uniform sampling. See section C in the appendix for tables showing our experimental data.

The code for the experiments, along with additional implementations of our sketch are available at https:// github.com/wswartworth/tensor Sampling.

Impact statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ahle, T. D. and Knudsen, J. B. Almost optimal tensor sketch. ar Xiv preprint ar Xiv:1909.01821, 2019.

Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Velingker, A., Woodruff, D. P., and Zandieh, A. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 141 160. SIAM, 2020.

Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 20 29, 1996.

Clarkson, K. L. and Woodruff, D. P. Sketching for mestimators: A unified approach to robust regression. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 921 939. SIAM, 2014.

Fast Sampling-Based Sketches for Tensors

Cormode, G. and Firmani, D. A unifying framework for 0sampling algorithms. Distributed and Parallel Databases, 32:315 335, 2014.

Indyk, P. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307 323, 2006.

Indyk, P. and Mc Gregor, A. Declaring independence via the sketching of sketches. In SODA, volume 8, pp. 737 745, 2008.

Kailath, T. and Sayed, A. H. Fast reliable algorithms for matrices with structure. SIAM, 1999.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Li, Y., Woodruff, D., and Yasuda, T. Exponentially improved dimensionality reduction for l1: Subspace embeddings and independence testing. In Conference on Learning Theory, pp. 3111 3195. PMLR, 2021.

Meister, M., Sarlos, T., and Woodruff, D. Tight dimensionality reduction for sketching low degree polynomial kernels. Advances in Neural Information Processing Systems, 32, 2019.

Munteanu, A., Omlor, S., and Woodruff, D. Oblivious sketching for logistic regression. In International Conference on Machine Learning, pp. 7861 7871. PMLR, 2021.

Pham, N. and Pagh, R. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 239 247, 2013.

Verbin, E. and Zhang, Q. Rademacher-sketch: A dimensionality-reducing embedding for sum-product norms, with an application to earth-mover distance. In International Colloquium on Automata, Languages, and Programming, pp. 834 845. Springer, 2012.

Woodruff, D. P. and Zhang, Q. Distributed statistical estimation of matrix products with applications. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 383 394, 2018.

Woodruff, D. P. et al. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014.

Fast Sampling-Based Sketches for Tensors

A. Construction of an ℓ1 Embedding

We design a sketch for ℓ1 embeddings similar to (Verbin & Zhang, 2012). However we show how to design the sketch in such a way so as to allow fast application to rank-one tensors.

We describe the general structure of our sketch here. We will later choose parameters in order to obtain the desired guarantees.

Description of sketch. Our sketch S consists of a series of ℓsampling levels numbered 0, . . . , ℓ 1. At each level h, there are two parameters: Th and ph. Th is the number of buckets at level h, and ph is the sampling probability at level h. Each bucket in level h is formed by taking a random sign combination of coordinates corresponding to a (ph/Th)-sample. So if S is a (ph/Th)-sample, a bucket in level h takes the form

where x Rn is the vector that we sketch, and σ {+1, 1}n. We will choose σ so that σi is a Kronecker product of q random sign vectors, each of which is 4-wise independent. We use the notation B(h) 1 , . . . , B(h) Th to represent the buckets at sampling level h. We will allow ph to be larger than 1, as long as ph/Th 1.

ℓ1 sketch analysis We first give the following technical lemma, which is given here to shorten the proof of the following lemma.

Lemma A.1. Let x Rd have entries in [0, α]. Let X1, . . . , Xk X be i.i.d. with Pr(X = |xj|) p for all j. Then for k c log 1

δ max( α p x 1 , 1), we have 1

k(X1 + . . . + Xk) 1

4p x 1 with probability at least 1 δ.

Proof. It suffices to consider the case where Pr(X = xj) = p for all j and is 0 with the remaining probability, since this majorizes the random variable in the lemma. Note that E(X1 + . . . + Xk) = pk x 1. We also have

E((X1 + . . . + Xk)2) = pk x 2 2 + k(k 1)p2 x 2 1 ,

So Var(X1 + . . . + Xk) pk x 2 2 pkα x 1 .

For k 16α p x 1 we have p

4pk x 1, and so by Chebyshev s inequality,

Pr(X1 + . . . + Xk 1

If instead we have k (24 log(1/δ)) max( 16α

p x 1 , 1), then group the sum into 24 log(1/δ) terms (each group of size at least one), each of which is at least half their expectation with 3/4 probability. By a Chernoff bound, at least half the groups will be at least this large with 1 δ probability, which means that the entire sum will be at least 1/4 its expectation, as desired.

Our next task is to lower bound the contribution of a given sampling level.

Lemma A.2. Suppose that the number of modes q is at most 3. Consider a sampling level of our sketch as described above, with sampling probability p and T buckets. Suppose that x 1 = 1. Fix m and γ and let L be the set of coordinates of x with magnitude in (m, γm]. Then for (p/T)|L| 1/4, T c log(1/δ) and p c log(1/δ) γm x L 1 , we have

|B1| + . . . + |BT | c x L 1

with probability at least 1 δ.

Proof. For convenience, set p = p/T. Consider a fixed bucket Bk formed from p -sample Sk and random signs σk. Also consider fixed j L. The probability that j Sk is at least p /2 by definition of the sampler.

Fast Sampling-Based Sketches for Tensors

Recall that σk is a product of q {2, 3} 4-wise independent sign vectors, say τ1, τ2, τ3 We claim that conditioned on j Sk, we have |Bk| c

p |xj| with constant probability. To see this, consider first sketching by τ1 to compress along the first mode. For the fiber y of x that contains j, | τ1, y | is at least c y 2 c xj 2 with constant probability, since τ1 has 4-wise independent signs 2. Then iteratively apply the same argument to sketching by τ2 and τ3 along the remaining modes.

Using the assumption that p |L| 1/4, we have that for i L, Pr(|Sk L| = {i}) p /4 by 2.6. Applying the previous Lemma A.1 with p /8 gives that 1 T (p |B1| + . . . + p |BT |) 1

4(p /8) x 1

with probability at least 1 δ when T c log 1

δ max( γm (p /8) x L 1 , 1). Recalling that p = p/T, this rearranges to the desired bound.

A.1. No Dilation

The following proposition will help to bound the contribution of the small coordinates to a given sampling level. This will be the key place where we take advantage of cancellation. Unlike earlier analyses such as (Verbin & Zhang, 2012), we do not have independence, so instead we will show how to get away with a Markov bound (earlier analyses could have applied a similar argument).

Proposition A.3. Suppose that x Rn with x 1 1 and x α. Let σ1, . . . , σn {0, 1, 1} be such that E(σiσj) = 0 for all i = j and E(|σi|) q. Then

E(|σ1y1 + . . . + σnyn|) min( αq, q).

Proof. By Jensen s inequality,

E(|σ1y1 + . . . + σnyn|)2 E |σ1y1 + . . . + σnyn|2

= E(σ2 1y2 1 + . . . + σ2 ny2 n)

qy2 1 + . . . + qy2 n = q y 2 2 .

Note that y 2 2 y 1 y α, so the first part of the bound follows. For the second half, we have

E(|σ1y1 + . . . + σnyn|) E(|σ1| |y1| + . . . + |σn| |yn|)

Lemma A.4. Consider applying our ℓ1-embedding sketch S to a vector x. Then with with failure probability at most, p1/β1 + . . . + pℓ/βℓ+ 1/10, we have the bound

x[αi,βi] 1 + X

Tiαi/pi, 1)

Proof. Let Ei be the event of sampling a coordinate larger than βi in layer i. There are at most 1/βi such coordinates, so the probability that Ei occurs is at most p/βi by a union bound.

Now let x αi be x where all coordinates larger in magnitude than αi are zeroed out. The previous lemma bounds the expected contribution of x αi to layer i by min( p

Tαi/q, 1) (after rescaling).

2This follows for example from the classic analysis of the AMS sketch(Alon et al., 1996). Taking the average of O(1) such sign measurements gives a 1 (1/3) multiplicative approximation to the ℓ2 norm of y with 2/3 probability say (by a Chebyshev bound). Thus, each measurement individually must be c y 2 with constant probability.

Fast Sampling-Based Sketches for Tensors

Now let x[αi,βi] contain all coordinates of x between αi and βi with the remaining coordinates zeroed out. The expected contribution of x[αi,βi] to layer i is x[αi,βi] 1 since the layers are unbiased.

Thus with failure probability at most, p1/β1 + . . . + pℓ/βℓ+ 1/10, we have the bound

x[αi,βi] 1 + X

Tiαi/pi, 1)

on the total ℓ1 norm of the sketch.

As a corollary, a standard net argument, such as that given in (Clarkson & Woodruff, 2014) for example, extends our embedding result for a single vector to a subspace.

Corollary A.5. Let A Rn d be a matrix. There is an oblivious sketch S with sketching dimension m = O(d2 log n + log4 n) such that c1 Av 1 SAv 1 c2 Av 1 ,

where 0 < c1 < c2 are absolute constants. Moreover all entries of S can be taken to be in {0, +1, 1}, and S can be applied to a tensor in (Rk) 2 in O(mk) time and can be applied to a tensor in (Rk) 3 in O(mk log2 k) time.

Combining the bounds

Theorem A.6. There is an O(1)-distortion ℓ1 embedding sketch S Rn Rm with sketching dimension m = O(log4 n + log2(1/δ) log n) that satisfies Sx 1 c x 1 with constant probability, and satisfies Sx 1 c x 1 with probability at least 1 δ.

Moreover our sketch can be applied to rank one tensors in Rk Rk in O(mk) time, and to rank one tensors in Rk Rk Rk

in O(mk log2 k) time.

Proof. For h 0, we set ph = qh for a fixed q and set Th = T for a fixed value of T, both to be determined. For notational convenience we will also define a layer 1 of the sketch with p 1 = q 1 and T 1 = T. While q 1 is larger than 1, T will be chosen so that q 1/T < 1.

Contraction Bound. Divide the coordinates of x into levels Li for i 0, with Li consisting of the coordinates of x in (qi+1, qi] and all other coordinates replaced with 0. We say that a level Li is heavy if x Li 1 1/(10ℓ).

Suppose that Li is heavy. We will choose parameters so that level i 1 of the sketch, preserves a constant fraction of Li s mass. Setting p = qi 1 in Lemma A.2, the conditions we need are qi 1|Li|/T 1/4, T c log(1/δ) and qi 1 c log(1/δ) qi x Li 1 . Under the assumption that Li is heavy, we have x Li 1 1 10ℓ. Also |Li| 1/qi+1 since

x 1 = 1. Therefore it suffices to choose parameters so that the following relations hold:

1. 1 q c(log 1

(Note that the condition T c log 1

δ is redundant in light of the two listed.)

Dilation Bound. Next we bound the dilation of the sketch on x with x 1 = 1. In the dilation bound above, we choose αi = pi+3 and βi = pi 1.

Then the first sum in the dilation bound Lemma A.4, is bounded by 4 since each coordinate of x appears in the sum at most 4 times. For the second sum, we have

Tiαi/pi, 1)

Fast Sampling-Based Sketches for Tensors

We will choose parameters so that Tq3 < 1/ℓ2, so that this sum will be bounded by 1. Then

Tiαi/pi, 1) 1 + 1 = 2.

We therefore get an O(1) dilation bound with failure probability at most 1 10 + Pℓ i=1 qi

qi 1 qℓ. Thus we must choose q c/ℓ.

Choosing Parameters. Taking stock of our constraints on parameters, we have

1. 1 q c(log 1

3. Tq3 < 1/ℓ2

These can be satisfied by setting 1

q = c max(log 1

δ ℓ, ℓ2) and T = c/q2. Since ℓ log n this yields an overall bound of T = O(log4 n + log2(1/δ) log2 n). Since there are T buckets in each of ℓlevels, this gives the desired bound on the sketching dimension.

Finally the time bounds follow by using our construction of p-samples that admit fast summation: we first apply the σi s given in our construction, which takes O(n) time. We then compute the sum of the resulting tensor entries over a p-sample that admits O(n) summation. This takes O(n) time in the two mode case, and O(n log2 n) time in the three mode case.

B. Proof of Lemma 2.9

Let i supp(X) By 2.6, we have

Pr(S supp(X) = {i}) 1

2p(1 2p|supp(X)|) 1

From the definition of a p-sample we have

Pr(S supp(X) = {i}) Pr(i S) p.

Also from 2.6,

Pr(|S supp(X)| = 1) 1

2|supp(X)|p(1 2p|supp(X)|)

4 |supp(X)| p.

By taking a union bound over i supp(X), we have

Pr(|S supp(X)| = 1) |supp(X)|p.

It follows from this that

Pr(i S |S supp(X)| = 1) = Pr(S supp(X) = {i})

Pr(|S supp(X)| = 1)

(1/4)p |supp(X)p| = 1 4|supp(X)|

Pr(i S |S supp(X)| = 1) p (1/4)|supp(X)|p

= 4 |supp(X)|.

Fast Sampling-Based Sketches for Tensors

C. Experiment Data

All experiments are carried out on a tensor of shape 40 40 40 for 1000 trials. For our ℓ0 sampler, we take our levels to have sampling probabilities increasing in powers of 5. We additionally use 10 buckets per level. The following tables give the raw data for our experiments.

Fast Sampling-Based Sketches for Tensors

Rectangle dimensions Fraction in first rectangle Expected Fraction Failures ((1, 1, 20), (1, 1, 1)) 0.9650 0.9524 0 ((1, 10, 20), (1, 1, 1)) 0.9970 0.9950 0 ((1, 20, 20), (1, 1, 1)) 0.9980 0.9975 1 ((20, 20, 20), (1, 1, 1)) 1.0000 0.9999 1 ((1, 1, 20), (10, 10, 10)) 0.0200 0.0196 0 ((1, 10, 20), (10, 10, 10)) 0.1610 0.1667 0 ((1, 20, 20), (10, 10, 10)) 0.2970 0.2857 0 ((20, 20, 20), (10, 10, 10)) 0.8959 0.8889 1 ((1, 1, 20), (20, 20, 20)) 0.0010 0.0025 3 ((1, 10, 20), (20, 20, 20)) 0.0160 0.0244 2 ((1, 20, 20), (20, 20, 20)) 0.0500 0.0476 0 ((20, 20, 20), (20, 20, 20)) 0.5130 0.5000 0

Table 1. In each experiment, the support consists of two disjoint boxes. The dimensions of the two boxes are given in the leftmost column. The table shows the fraction of points that our ℓ0 sampler chooses in the first rectangle, as well as the fraction that we expect for a perfect ℓ0 sampler. The rightmost column records the number of times our ℓ0 sampler failed.

Fast Sampling-Based Sketches for Tensors

Rectangle shape Fraction in rectangle Failures (1, 1, 1) 0.47 5 (1, 1, 3) 0.51 0 (1, 1, 9) 0.52 2 (1, 1, 27) 0.52 0 (1, 3, 1) 0.52 1 (1, 3, 3) 0.51 4 (1, 3, 9) 0.50 0 (1, 9, 1) 0.51 3 (1, 9, 3) 0.51 0 (1, 9, 9) 0.50 0 (3, 1, 1) 0.49 1 (3, 1, 3) 0.50 1 (3, 1, 9) 0.52 0 (3, 3, 1) 0.49 0 (3, 3, 3) 0.53 0 (3, 3, 9) 0.51 0 (3, 9, 1) 0.51 0 (3, 9, 3) 0.49 0 (3, 9, 9) 0.51 0 (9, 1, 1) 0.53 1 (9, 1, 3) 0.50 0 (9, 1, 9) 0.51 0 (9, 3, 1) 0.51 0 (9, 3, 3) 0.49 0 (9, 3, 9) 0.53 0 (9, 9, 1) 0.54 0 (9, 9, 3) 0.50 1 (9, 9, 9) 0.52 1 (1, 3, 27) 0.49 1 (1, 9, 27) 0.50 0 (1, 27, 1) 0.51 0 (1, 27, 3) 0.50 0 (1, 27, 9) 0.50 0 (1, 27, 27) 0.48 0 (3, 1, 27) 0.52 1 (3, 3, 27) 0.51 1 (3, 9, 27) 0.53 0 (3, 27, 1) 0.51 0 (3, 27, 3) 0.50 0 (3, 27, 9) 0.53 2 (3, 27, 27) 0.47 3 (9, 1, 27) 0.48 1 (9, 3, 27) 0.53 0 (9, 9, 27) 0.51 1 (9, 27, 1) 0.50 1 (9, 27, 3) 0.47 1 (9, 27, 9) 0.50 1 (9, 27, 27) 0.49 0 (27, 1, 1) 0.53 0 (27, 1, 3) 0.54 0 (27, 1, 9) 0.52 2 (27, 1, 27) 0.50 2

Fast Sampling-Based Sketches for Tensors

(27, 3, 1) 0.52 0 (27, 3, 3) 0.49 2 (27, 3, 9) 0.50 0 (27, 3, 27) 0.50 0 (27, 9, 1) 0.50 1 (27, 9, 3) 0.50 0 (27, 9, 9) 0.51 0 (27, 9, 27) 0.53 0 (27, 27, 1) 0.47 0 (27, 27, 3) 0.51 1 (27, 27, 9) 0.48 0 (27, 27, 27) 0.48 0 Table 2: Each row of the table corresponds to an experiment run on a support shape consisting of a box of dimension specified by the left column of the table, as well as an equal number of additional uniformly sampled entries. We run 1000 trials with our ℓ0 sampler, and record the fraction of samples from successful runs of the sampler that are in the rectangle. The rightmost column records the number of times the sampler failed.