# efficiently_learning_fourier_sparse_set_functions__eccb2f01.pdf

Efﬁciently Learning Fourier Sparse Set Functions

Andisheh Amrollahi ETH Zurich Zurich, Switzerland amrollaa@ethz.ch

Amir Zandieh EPFL Lausanne, Switzerland amir.zandieh@epfl.ch

Michael Kapralov EPFL Lausanne, Switzerland michael.kapralov@epfl.ch

Andreas Krause ETH Zurich Zurich, Switzerland krausea@ethz.ch

Learning set functions is a key challenge arising in many domains, ranging from sketching graphs to black-box optimization with discrete parameters. In this paper we consider the problem of efﬁciently learning set functions that are deﬁned over a ground set of size n and that are sparse (say k-sparse) in the Fourier domain. This is a wide class, that includes graph and hypergraph cut functions, decision trees and more. Our central contribution is the ﬁrst algorithm that allows learning functions whose Fourier support only contains low degree (say degree d = o(n)) polynomials using O(kd log n) sample complexity and runtime O(kn log2 k log n log d). This implies that sparse graphs with k edges can, for the ﬁrst time, be learned from O(k log n) observations of cut values and in linear time in the number of vertices. Our algorithm can also efﬁciently learn (sums of) decision trees of small depth. The algorithm exploits techniques from the sparse Fourier transform literature and is easily implementable. Lastly, we also develop an efﬁcient robust version of our algorithm and prove ℓ2/ℓ2 approximation guarantees without any statistical assumptions on the noise.

1 Introduction

How can we learn the structure of a graph by observing the values of a small number of cuts? Can we learn a decision tree efﬁciently by observing its evaluation on a few samples? Both of these important applications are instances of the more general problem of learning set functions. Consider a set function which maps subsets of a ground set V of size n to real numbers, x : 2V R. Set functions that arise in applications often exhibit structure, which can be effectively captured in the Fourier (also called Walsh-Hadamard) basis. One common studied structure for set functions is Fourier sparsity [2]. A k-Fourier-sparse set function contains no more than k nonzero Fourier coefﬁcients. A natural example for k-Fourier-sparse set functions are cut functions of graphs with k edges or evaluations of a decision tree of depth d [7, 8, 12]. The cut function of a graph only contains polynomials of degree at most two in the Fourier basis and in the general case, the cut function of a hypergraph of degree d only contains polynomials of degree at most d in the Fourier basis [12]. Intuitively this means that these set functions can be written as sums of terms where each term depends on at most d elements in the ground set. Also a decision tree of depth d only contains polynomials of degree at most d in the Fourier basis [7][8]. Learning such functions has recently

The ﬁrst two authors contributed equally Supported by ERC Starting Grant SUBLINEAR.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

found applications in neural network hyper-parameter optimization [5]. Therefore, the family of Fourier sparse set functions whose Fourier support only contains low order terms is a natural and important class of functions to consider.

Related work One approach for learning Fourier sparse functions uses Compressive Sensing (CS) methods [12]. Suppose we know that the Fourier transform of our function bx is k-sparse i.e. |supp(bx)| k, and supp(bx) P for some known set P of size p. In [12] it is shown that recovery of bx is possible (with high probability) by observing the value of x on O(k log4 p) subsets chosen independently and uniformly at random. They utilize results from [10, 13] which prove that picking O(k log4 p) rows of the Walsh-Hadamard matrix independently and uniformly at random results in a matrix satisfying the RIP which is required for recovery. For the case of graphs p = n 2 = O(n2) and one can essentially learn the underlying graph with O(k log4 n) samples. In fact this result can be further improved, and O(k log2 k log n) samples sufﬁce [4]. Computationally, for the CS approach, one may use matching pursuit which takes Ω(kp) time and thus results in runtime of Ω(knd) for k Fourier sparse functions of order d. This equals Ω(kn2) for graphs, where d = 2. In [12], proximal methods are used to optimize the Lagrangian form of the ℓ1 norm minimization problem. Optimization is performed on p variables which results in Ω(n2) runtime for graphs and Ω(nd) time for the general order d sparse recovery case. Hence, these algorithms scale exponentially with d and have at least quadratic dependence on n even in the simple case of learning graph cut functions.

There is another line of work on this problem in the sparse Fourier transform literature. [11] provides a non-robust version of the sparse Walsh Hadamard Transform (WHT). This algorithm makes restrictive assumptions on the signal, namely that the k non-zero Fourier coefﬁcients are chosen uniformly at random from the Fourier domain. This is a strong assumption that does not hold for the case of cut functions or decision trees. This work is extended in [4] to a robust sparse WHT called SPRIGHT. In addition to the the random uniform support assumption, [4] further presumes that the Fourier coefﬁcients are ﬁnite valued and the noise is Gaussian. Furthermore, all existing sparse WHT algorithms are unable to exploit low-degree Fourier structure.

Our results We build on techniques from the sparse Fourier transform literature [3, 6, 2] and develop an algorithm to compute the Walsh-Hadamard transform (WHT) of a k-Fourier-sparse signal whose Fourier support is constrained to low degree frequencies (low degree polynomials). For recovering frequencies with low degree, we utilize ideas that are related to compressive sensing over ﬁnite ﬁelds [1]. We show that if the frequencies present in the support of bx are of low order then there exists an algorithm that computes WHT in O(kn log2 k log n log d) time using O(kd log n) samples. As opposed to [11], we avoid distributional assumptions on the support using hashing schemes. Our approach is the ﬁrst one to achieve the sampling complexity of O(kd log n). Moreover its running time scales linearly in n and there is no exponential dependence on d. For the important special case of graphs, where d = 2, our sampling complexity is near optimally O(k log n) and our runtime is O(kn log2 k log n) which is strictly better than CS methods which take at least quadratic time in n. This allows us to learn sparse graphs which have in the range of 800 vertices in 2 seconds whereas the previous methods [12] were constrained to the range of 100 for similar runtimes.

For the case where bx is not exactly k-sparse, we provide novel robust algorithms that recover the k dominant Fourier coefﬁcients with provable ℓ2/ℓ2 approximation guarantees. We provide a robust algorithm using appropriate hashing schemes and a novel analysis. We further develop a robust recovery algorithm that uses O(kd log n log(d log n)) samples and runs in time O nk log3 k + nk log2 k log n log(d log n) log d .

2 Problem Statement

Here we deﬁne the problem of learning set functions. Consider a set function which maps subsets of a ground set V {1, . . . , n} = [n] of size n to real numbers, x : 2V R. We assume oracle access to this function, that is, we can observe the function value x(A) for any subset A that we desire. The goal is to learn the function, that is to be able to evaluate it for all subsets B V . A problem which has received considerable interest is learning cut functions of sparse (in terms of edges) graphs [12]. Given a weighted undirected graph G = (V, E, w), the cut function associated to G is deﬁned as x(A) = P

s A,t V \A w(s, t), for every A V .

Note that we can equivalently represent each subset A V by a vector t Fn 2 which is the indicator of set A. Here F2 = {0, 1} denotes the ﬁnite ﬁeld with 2 elements. Hence the set function can be viewed as x : Fn 2 R. We denote the Walsh-Hadamard transform of x : Fn 2 R by bx : Fn 2 R. It is deﬁned as: bxf = 1

t Fn 2 xt ( 1) f,t , f Fn 2.

The inner product f, t throughout the paper is performed modulo 2.

The Fourier transform of the graph cut function bx is the following,

s,t V w(s, t) if f = (0, . . . , 0) w(s, t)/2 if fs = ft = 1 and fi = 0 i = s, t 0 otherwise .

It is clear that the Fourier support of the cut function for graph G contains only |E| + 1 nonzero elements (and hence it is sparse). Furthermore, the nonzero Fourier coefﬁcients correspond to frequencies with hamming weights at most 2.

One of the classes of set functions that we consider is that of exactly low order Fourier sparse functions. Under this model we address the following problem:

Input: oracle access to x : Fn 2 R such that bx 0 k and |f| d for all f support(bx) Output: nonzero coefﬁcients of bx and their corresponding frequencies (1)

where |f| denotes the Hamming weight of f.

We also consider the robust version of problem (1) where we only have access to noisy measurements of the input set function. We make no assumption about the noise, which can be chosen adversarially. Equivalently one can think of a general set function whose spectrum is well approximated by a low order sparse function which we refer to as head. Head of bx is just the top k Fourier coefﬁcients bxf such that the frequency has low Hamming weight |f| d. We refer to the noise spectrum as tail.

Deﬁnition 1 (Head and Tail norm). For all integers n, d, and k we deﬁne the head of bx : Fn 2 R as,

bxhead := arg min y:Fn 2 R y 0 k |j| d for all j supp(y)

The tail norm of bx is deﬁned as, Err(bx, k, d) := bx bxhead 2 2.

Since the set function to be learned is only approximately in the low order Fourier sparse model, it makes sense to consider the approximate version of problem (1). We use the well known ℓ2/ℓ2 approximation to formally deﬁne the robust version of problem (1) as follows,

Input: oracle access to x : Fn 2 R Output: function bχ : Fn 2 R

such that bχ bx 2 2 (1 + δ)Err(bx, k, d), |f| d for all f support(bχ)

Note that no assumptions are made about the function x and it can be any general set function.

3 Algorithm and Analysis

In this section we present our algorithm and analysis. We use techniques from the sparse FFT literature [3, 6, 2]. Our main technical novelty is a new primitive for estimating a low order frequency, i.e., |f| d, efﬁciently using an optimal number of samples O(d log n) given in Section 3.1. This primitive relies heavily on the fact that a low order frequency is constrained on a subset of size n d as opposed to the whole universe of size 2n. We show that problem (1) can be solved quickly and using a few samples from the function x by proving the following theorem,

Theorem 2. For any integers n, k, and d, the procedure EXACTSHT solves problem (1) with probability 9/10. Moreover the runtime of this algorithm is O kn log2 k log n log d and the sample complexity of this procedure is O (kd log n).

We also show that problem (2) can be solved efﬁciently by proving the following theorem in the full version of this paper, Theorem 3. For any integers n, k, and d, the procedure ROBUSTSHT solves problem (2) with probability 9/10. Moreover the runtime of this procedure is O nk log3 k + nk log2 k log n log(d log n) log d and the sample complexity of the procedure is O (kd log n log(d log n)).

Remark: This theorem proves that for any arbitrary input signal, we are able to achieve the ℓ2/ℓ2 guarantee using O (kd log n log(d log n)) samples. Using the techniques of [9] one can prove that the sample complexity is optimal up to log(d log n) factor. Note that it is impossible to achieve this sample complexity without exploiting the low degree structure of the Fourier support.

3.1 Low order frequency recovery

In this section we provide a novel method for recovering a frequency f Fn 2 with bounded Hamming weight |f| d, from measurements mi, f i [s] for some s = O(d log n). The goal of this section is to design a measurement matrix M Fs n 2 with small s, such that for any f Fn 2 with |f| d the following system of constraints, with constant probability, has a unique solution j = f and has an efﬁcient solver,

j Fn 2 such that Mj = Mf |j| d .

To design an efﬁcient solver for the above problem with optimal s, we ﬁrst need an optimal algorithm for recovering frequencies with weight one |f| 1. In this case, we can locate the index of the nonzero coordinate of f optimally via binary search using O(log n) measurements and runtime.

Deﬁnition 4 (Binary search vectors). For any integer n, the ensemble of vectors {vl} log2 n l=0 Fn 2 corresponding to binary search on n elements is deﬁned as follows. Let v0 = {1}n (the all ones

vector). For every l {1, , log2 n } and every j [n], vl j = j (j mod 2l) 2l 1 k .

Lemma 5. There exists a set of measurements {mi}s i=1 for s = log2 n + 1 together with an algorithm such that for every f Fn 2 with |f| 1 the algorithm can recover f from the measurements f, mi in time O(log2 n).

To recover a frequency f with Hamming weight d, we hash the coordinates of f randomly into O(d) buckets. In expectation, a constant fraction of nonzero elements of f get isolated in buckets, and hence the problem reduces to the weight one recovery. We know how to solve this using binary search as shown in Lemma 5 in time O(log n) and with sample complexity O(log n). We recover a constant fraction of the nonzero indices of f and then we subtract those from f and recurse on the residual. The pseudocode of the recovery procedure is presented in Algorithm 1. Lemma 6. For any integers n and d , any power of two integer D 128d, and any frequency f Fn 2 with |f| d, the procedure RECOVERFREQUENCY given in Algorithm 1 outputs f with probability at least 7/8, if we have access to the following,

1. For every r = 0, 1, , log4 D, a hash function hr : [n] [D/2r] which is an instance from a pairwise independent hash family.

2. For every l = 0, 1, , log2 n and every r = 0, 1, , log4 D, the measurements φl r(i) that are equal to φl r(i) = P

j h 1 r (i) fj vl j for every i [D/2r].

Moreover, the runtime of this procedure is O(D log D log n) and the number of measurements is O(D log n).

Proof. The proof is by induction on the iteration number r = 0, 1, , T. We denote by Er the event |f f (r)| d 4r , that is the sparsity goes down by a factor of 4 in every iteration up to rth iteration. The inductive hypothesis is Pr[Er+1|Er] 1 1 16 2r .

Algorithm 1 RECOVERFREQUENCY input: power of two integer D, hash functions hr : [n] [D/2r] for every r {0, 1, , log4 D}, measurement vectors φl r FD/2r

2 for every l = 0, 1, log2 n and every r = 0, 1, , log4 D. output: recovered frequency f.

1: {vl} log2 n l=0 binary search vectors on n elements (Deﬁnition 4), T log4 D, f (0) {0}n

2: for r = 0 to T do 3: w {0}n. 4: for i = 1 to D/2r do

5: if φ0 r(i) P

j h 1 r (i) f (r) j v0 j = 1 then

6: index {0} log2 n , a log2 n bits pointer. 7: for l = 1 to log2 n do

8: if φl r(i) P

j h 1 r (i) f (r) j vl j = 1 then

9: [index]l 1, set lth bit of index to 1. 10: w(index) 1, set the coordinate of w positioned at index to 1. 11: f (r+1) f (r) + w. 12: return f (T +1).

Conditioning on Er we have that |f f (r)| d 4r . For every i [D/2r] and every l {0, 1, , log2 n } it follows from the deﬁnition of φl r that,

j h 1 r (i)

f (r) j vl j = X

j h 1 r (i)

fj f (r) j vl j.

Let us denote by S the support of vector f f (r), namely let S = supp f f (r) .

From the pairwise independence of the hash function hr the following holds for every a S,

Pr[hr(a) hr(S \ {a})] 2r |S|

D 2r 1 128 4r 1 128 2r .

This shows that for every a S, with probability 1 1 128 2r , the bucket hr(a) contains no other element of S. Because the vector f f (r) restricted to the elements in bucket h 1 r (hr(a)) has Hamming weight one, for every a S,

Pr f f (r 1)

h 1 r (hr(a))

= 1 1 1 128 2r .

If the above condition holds, then it is possible to ﬁnd the index of the nonzero element via binary search as in Lemma 5. The for loop in line 7 of Algorithm 1 implements this. Therefore with probability 1 1 16 2r by Markov s inequality a 1 1/8 fraction of the support elements, S, gets recovered correctly and at most 1/8 fraction of elements remain unrecovered and possibly result in false positive. Since the algorithm recovers at most one element per bucket, the total number of falsely recovered indices is no more than the number of non-isolated buckets which is at most 1/8 |S|. Therefore with probability 1 1 16 2r , the residual at the end of rth iteration has sparsity

1/8 |S| + 1/8 |S| = 1/4 |S|, i.e. f f (r+1) |S|

4 d 4r+1 . This proves the inductive step.

It follows from the event ET for T = log4 D that f (T ) = f, where f (T ) is the output of Algorithm 1. The inductive hypothesis along with union bound implies that Pr ET PT r=1 Pr Er|Er 1 + Pr E0 PT r=0 1 16 2r 1/8.

Runtime: the algorithm has three nested loops and the total number of repetitions of all loops together is O(D log n). The recovered frequency f (r) always has at most O(D) nonzero entries therefore the time to calculate P

j h 1 r (i) f (r 1) j vl j for a ﬁxed r and a ﬁxed l and all i [D/2r] is O(D). Therefore the total runtime is O(D log D log n).

Number of measurements: the number of measurements is the total size of the measurement vectors φl r which is O(D log n).

3.2 Signal reduction

We now develop the main tool for estimating the frequencies of a sparse signal, namely the HASH2BINS primitive. If we hash the frequencies of a k-sparse signal into O(k) buckets, we expect most buckets to contain at most one of the elements of the support of our signal. The next deﬁnition shows how we compute the hashing of a signal in the time domain.

Deﬁnition 7. For every n, b N, every a Fn 2, and every σ Fn b 2 and every x : Fn 2 R, we

deﬁne the hashing of bx as ua σ : Fb 2 R, where ua σ(t) = q

2b xσt+a, for every t Fb 2.

We denote by B 2b the number of buckets of the hash function. In the next claim we show that the Fourier transform of ua σ corresponds to hashing bx into B buckets.

Claim 8. For every j Fb 2, bua σ(j) = P

f Fn 2 :σ f=j bxf ( 1) a,f .

Let h(f) σ f. For every j Fb 2, bua σ is the sum of bxf ( 1) a,f for all frequencies f Fn 2 such that h(f) = j, hence h(f) can be thought of as the bucket that f is hashed into. If the matrix σ is chosen uniformly at random then the hash function h( ) is pairwise independent.

Claim 9. For any n, b N, if the hash function h : Fn 2 Fb 2 is deﬁned as h( ) = σ ( ), where σ Fn b 2 is a random matrix whose entries are distributed independently and uniformly at random on F2, then for any f = f Fn 2 it holds that Pr[h(f) = h(f )] = 1

B , where the probability is over picking n b random bits of σ.

Algorithm 2 HASH2BINS

input: signal x R2n, signal bχ R2n, integer b, binary matrix σ Fn b 2 , shift vector a Fn 2. output: hashed signal bua σ.

1: Compute bua σ = FHT q

2b xσ( )+a . FHT is the fast Hadamard transform algorithm

2: bua σ(j) bua σ(j) P

f Fn 2 :σ f=j bχf ( 1) a,f for every j Fb 2. 3: return bua σ.

The HASH2BINS primitive computes the Fourier coefﬁcients of the residue signal that are hashed to each of the buckets. We denote by bχ the estimate of bx in each iteration. As we will see in Section 3.3, the recovery algorithm is iterative in the sense that we iterate over bx bχ (the residue) whose sparsity is guaranteed to decrease by a constant factor in each step.

Claim 10. For any signal x, bχ : Fn 2 R, integer b, matrix σ Fn b 2 , and vector a Fn 2 the procedure HASH2BINS(x, bχ, b, σ, a) given in Algorithm 2 computes the following using O(B) samples from x in time O(Bn log B + bχ 0 n log B)

bua σ(j) = X

f Fn 2 :σ f=j (\ x χ)f ( 1) a,f .

3.3 Exact Fourier recovery

In this section, we present our algorithm for solving the exact low order Fourier sparse problem deﬁned in (1) and prove Theorem 2. Let S supp(bx). Problem (1) assumes that |S| k and also for every f S, |f| d. The recovery algorithm hashes the frequencies into B = 2b buckets using Algorithm 2. Every frequency in the support f S is recoverable, with constant probability, if no other frequency from the support collides with it in the hashed signal. The collision event is formally deﬁned below, Deﬁnition 11 (Collision). For any frequency f Fn 2 and every sparse signal bx with support S = supp(bx), the collision event Ecoll(f) corresponding to the hash function h(f) = σ f holds iff h(f) h(S \ {f}).

Claim 12 (Probability of collision). For every f Fn 2, if the hash function h : Fn 2 Fb 2 is deﬁned as h( ) = σ ( ), where σ Fn b 2 is a random matrix whose entries are distributed independently and uniformly at random on F2 then Pr[Ecoll(f)] k

B (see Deﬁnition 11). The probability is over the randomness of matrix σ.

If the hash function h( ) = σ ( ) is such that the collision event Ecoll(f) does not occur for a frequency f, then it follows from Claim 8 and Deﬁnition 11 that for every a Fn 2,

bua σ(h(f)) = bxf ( 1) a,f .

Therefore, under this condition, the problem reduces to d-sparse recovery. If a = {0}n then, bua σ(h(f)) = bxf. Hence for any m Fn 2, one can learn the inner product m, f by comparing the sign of bum σ (h(f)) = bxf ( 1) m,f and bua σ(h(f)). If the signs are the same then ( 1) m,f = 1 meaning that m, f = 0 and if the signs are different then m, f = 1. In Section 3.1 we gave an algorithm for learning a low order frequency |f| d from measurements of the form m, f . So putting these together gives the inner subroutine for our sparse fast Hadamard transform, which performs one round of hashing, presented in Algorithm 3.

Algorithm 3 SHTINNER

input: signal x R2n, signal bχ R2n, failure probability p, integer b, integer d. output: recovered signal bχ .

1: Let {vl} log2 n l=0 be binary search vectors on n elements (Deﬁnition 4). 2: D smallest power of two integer s.t. D 128d, R 2 log2(1/p) . 3: For every r {0, 1, , log4 D} and every s [R], let hs r : [n] [D/2r] be an independent copy of a pairwise independent hash function. 4: For every r {0, 1, , log4 D}, every s [R], and every j [D/2r] let wj r,s Fn 2 be the binary indicator vector of the set hs r(j) 1. 5: For every s [R], every r {0, 1, , log4 D} and every l {0, 1, , log2 n } and every j [D/2r], add wj r,s vl to set As.

6: Let σ Fn b 2 be a random matrix. Each entry is independent and uniform on F2. 7: For every a s [R]As compute bua σ = HASH2BINS(x, bχ, b, σ, a). 8: for j = 1 to B do 9: Let L be an empty multi-set. 10: for s [R] do 11: for every r {0, , log4 D}, every i [D/2r], and every l {0, , log2 n } do 12: if buc σ(j) = 0, where c = {0}n then

13: if buc σ(j) and bu wi r,s vl

σ (j) have same sign then φl r(i) 0. else φl r(i) 1.

14: f RECOVERFREQUENCY D, {hs r}log2 D r=0 , n {φl r}log4 D r=0 o log2 n

15: Append f to multi-set L. 16: f majority(L) 17: bχ f buc(j), where c = {0}n.

18: return bχ .

Lemma 13. For all integers b and d, every signals x, bχ R2n such that |ξ| d for every ξ supp(\ x χ), and any parameter p > 0, Algorithm 3 outputs a signal bχ R2n such that |supp(bχ )| |supp(\ x χ)| and also for every frequency f supp(\ x χ), if the collision event Ecoll(f) does not happen then,

Pr h bχ f = (\ x χ)f i 1 p.

Moreover the sample complexity of this procedure is O(Bd log n log 1

p) and also its time complexity

is O B log B(n + d log n log 1

p) + n B log n log d log 1

p + bχ 0 n(log B + log n log d log 1

Lemma 14. For any parameter p > 0, all integers k, d, and b log2(k/p), every signal x, bχ R2n such that \ x χ 0 k and |ξ| d for every ξ supp(\ x χ), the output of

SHTINNER(x, bχ, p, b, d), bχ satisﬁes the following with probability at least 1 32p,

bx bχ bχ 0 k/8.

Our sparse Hadamard transform algorithm iteratively calls the primitive SHTINNER to reduce the sparsity of the residual signal by a constant factor in every iteration. Hence, it terminates in O(log k) iterations. See Algorithm 4.

Algorithm 4 EXACTSHT

input: signal x R2n, failure probability q, sparsity k, integer d. output: estimate bχ R2n. 1: p(1) q/32, b(1) log2 64k

q , w(0) {0}2n, T log8 k . 2: for r = 1 to T do 3: eχ SHTINNER(x, w(r 1), p(r), b(r), d) 4: w(r) w(r 1) + eχ. 5: p(r+1) p(r)/2, b(r+1) b(r) 2. 6: bχ w(T ). 7: return bχ.

Proof of Theorem 2: The proof is by induction. We denote by Er the event corresponding to bx w(r) 0 k 8r . The inductive hypothesis is Pr[Er|Er 1] 1 16p(r). Conditioned on Er 1 we have that bx w(r 1) 0 k 8r 1 . The number of buckets in iteration r of the algorithm is B(r) = 2br 64k 4r 1 q. Hence, it follows from Lemma 14, that with probability 1 32p(r), bx w(r) 0 k 8r . This proves the inductive step.

Runtime and Sample complexity: In iteration r [ log8 k ], the size of the bucket B(r) = 2b(r) = 64k

q 4r and the error probability p(r) = q 32 2r . Moreover at most P

r B(r) elements are added to bχ, hence we can assume that bχ 0 128k

q . From Lemma 13 it follows that the total runtime is O kn log2 k log n log d .

The sample complexity of iteration r is O kd

2r log n log 2r hence the total sample complexity is dominated by the sample complexity of the ﬁrst iteration which is equal to O (kd log n).

4 Experiments

We test our EXACTSHT algorithm for graph sketching on a real world data set. We utilize the autonomous systems dataset from the SNAP data collection.3 In order to compare our methods with [12] we reproduce their experimental setup. The dataset consists of 9 snapshots of an autonomous system in Oregon on 9 different dates. The goal is detect which edges are added and removed when comparing the system on two different dates. As a pre-processing step, we ﬁnd the common vertices that exist on all dates and look at the induced subgraphs on these vertices. We take the symmetric differences (over the edges) of dates 7 and 9. Results for other date combinations can be found in the supplementary material. This results in a sparse graph (sparse in the number of edges). Recall that the running time of our algorithm is O(kn log2 k log n log d) which reduces to O(nk log2 k log n) for the case of cut functions where d = 2.

4.1 Sample and time complexities as number of vertices varies

In the ﬁrst experiment depicted in Figure 1b we order the vertices of the graph by their degree and look at the induced subgraph on the n largest vertices in terms of degree where n varies. For each n we pick e = 50 edges uniformly at random. The goal is to learn the underlying graph by observing the values of cuts. We choose parameters of our algorithm such that the probability of success is at least 0.9. The parameters tuned in our algorithm to reach this error probability are the initial number

3snap.stanford.edu/data/

(a) Avg. time vs. no. edges

(b) Avg. time vs. no. vertices

Table 1: Sampling and computational complexity

No. of vertices CS method Our method Runtime Samples Runtime Samples

70 1.14 767 0.85 6428 90 1.88 812 0.92 6490 110 3.00 850 0.82 6491 130 4.31 880 1.01 7549 150 5.34 905 1.16 7942 170 6.13 927 1.22 7942 190 7.36 947 1.18 7271 210 8.24 965 1.28 7271 230 1.38 7942 250 1.38 7271 300 1.66 8051 400 2.06 8794 500 2.42 8794 600 3.10 9646 700 3.35 9646 800 3.60 9646

of buckets the frequencies are hashed to and the ratio at which they reduce in each iteration. We plot running times as n varies. We compare our algorithm with that of [12] which utilizes a CS approach. We ﬁne-tune their algorithm by tuning the sampling complexity. Both algorithms are run in a way such that each sample (each observation of a cut value) takes the same time. As one can see our algorithm scales linearly with n (up to log factors) whereas the CS approach scales quadratically. Our algorithm continues to work in a reasonable amount of time for vertex sizes as much as 900 in under 2 seconds. Error bars depict standard deviations.

In Table 1 we include both sampling complexities (number of observed cuts) and running times as n varies. Our sampling complexity is equal to O(k log n). In practice they perform around a constant factor of 10 worse than the compressive sensing method, which are not provably optimal (see Section 1) but perform well in practice. In terms of computational cost, however, the CS approach quickly becomes intractable, taking large amounts of time on instance sizes around 200 and larger [12]. Asterisks in Table 1 refer to experiments that have taken too long to be feasible to run.

4.2 Time complexities as number of edges varies

Here we ﬁx the number of vertices to n = 100 and consider the induced subgraph on these vertices. We randomly pick e edges to include in the graph. We plot computational complexities. Our running time provably scales linearly in the number of edges as can be seen in Figure 1a.

[1] Abhik Kumar Das and Sriram Vishwanath. On ﬁnite alphabet compressive sensing. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5890 5894. IEEE, 2013.

[2] Oded Goldreich and Leonid A Levin. A hard-core predicate for all one-way functions. In Proceedings of the twenty-ﬁrst annual ACM symposium on Theory of computing, pages 25 32. ACM, 1989.

[3] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Nearly optimal sparse fourier transform. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 563 578. ACM, 2012.

[4] Ishay Haviv and Oded Regev. The restricted isometry property of subsampled fourier matrices. In Geometric Aspects of Functional Analysis, pages 163 179. Springer, 2017.

[5] Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: a spectral approach. In International Conference on Learning Representations, 2018.

[6] Piotr Indyk, Michael Kapralov, and Eric Price. (nearly) sample-optimal sparse fourier transform. In Proceedings of the twenty-ﬁfth annual ACM-SIAM symposium on Discrete algorithms, pages 480 499. Society for Industrial and Applied Mathematics, 2014.

[7] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. SIAM Journal on Computing, 22(6):1331 1348, 1993.

[8] Yishay Mansour. Learning boolean functions via the fourier transform. In Theoretical advances in neural computation and learning, pages 391 424. Springer, 1994.

[9] Eric Price and David P. Woodruff. (1 + eps)-approximate sparse recovery. In Proceedings of the 2011 IEEE 52Nd Annual Symposium on Foundations of Computer Science, FOCS 11, pages 295 304, Washington, DC, USA, 2011. IEEE Computer Society.

[10] Mark Rudelson and Roman Vershynin. On sparse reconstruction from fourier and gaussian measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 61(8):1025 1045, 2008.

[11] Robin Scheibler, Saeid Haghighatshoar, and Martin Vetterli. A fast hadamard transform for signals with sublinear sparsity in the transform domain. IEEE Transactions on Information Theory, 61(4):2115 2132, 2015.

[12] Peter Stobbe and Andreas Krause. Learning fourier sparse set functions. In Artiﬁcial Intelligence and Statistics, pages 1125 1133, 2012.

[13] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv e-prints, page ar Xiv:1011.3027, Nov 2010.