# sparse_polynomial_learning_and_graph_sketching__1c9d7b04.pdf Sparse Polynomial Learning and Graph Sketching Murat Kocaoglu1 , Karthikeyan Shanmugam1 , Alexandros G.Dimakis1 , Adam Klivans2 1Department of Electrical and Computer Engineering, 2Department of Computer Science The University of Texas at Austin, USA mkocaoglu@utexas.edu, karthiksh@utexas.edu dimakis@austin.utexas.edu, klivans@cs.utexas.edu Let f : { 1, 1}n R be a polynomial with at most s non-zero real coefficients. We give an algorithm for exactly reconstructing f given random examples from the uniform distribution on { 1, 1}n that runs in time polynomial in n and 2s and succeeds if the function satisfies the unique sign property: there is one output value which corresponds to a unique set of values of the participating parities. This sufficient condition is satisfied when every coefficient of f is perturbed by a small random noise, or satisfied with high probability when s parity functions are chosen randomly or when all the coefficients are positive. Learning sparse polynomials over the Boolean domain in time polynomial in n and 2s is considered notoriously hard in the worst-case. Our result shows that the problem is tractable for almost all sparse polynomials. Then, we show an application of this result to hypergraph sketching which is the problem of learning a sparse (both in the number of hyperedges and the size of the hyperedges) hypergraph from uniformly drawn random cuts. We also provide experimental results on a real world dataset. 1 Introduction Learning sparse polynomials over the Boolean domain is one of the fundamental problems from computational learning theory and has been studied extensively over the last twenty-five years [1 6]. In almost all cases, known algorithms for learning or interpolating sparse polynomials require query access to the unknown polynomial. An outstanding open problem is to find an algorithm for learning s-sparse polynomials with respect to the uniform distribution on { 1, 1}n that runs in time polynomial in n and g(s) (where g is any fixed function independent of n) and requires only randomly chosen examples to succeed. In particular, such an algorithm would imply a breakthrough result for the problem of learning k-juntas (functions that depend on only k n input variables; it is not known how to learn ω(1)-juntas in polynomial time). We present an algorithm and a set of natural conditions such that any sparse polynomial f satisfying these conditions can be learned from random examples in time polynomial in n and 2s. In particular, any f whose coefficients have been subjected to a small perturbation (smoothed analysis setting) satisfies these conditions (for example, if a Gaussian with arbitrarily small variance is added independently to each coefficient, f satisfies these conditions with probability 1). We state our main result here: Theorem 1. Let f be an s-sparse function that satisfies at least one of the following properties: a) (smoothed analysis setting)The coefficients {ci}s i=1 are in general position or all of them are perturbed by a small random noise. b) The s parity functions are linearly independent. c) All the coefficients are positive. Then we learn f with high probability in time poly(n, 2s). We note that smoothed-analysis, pioneered in [7], has now become a common alternative for problems that seem intractable in the worst-case. Our algorithm also succeeds in the presence of noise: Theorem 2. Let f = f1 + f2 be a polynomial such that f1 and f2 depend on mutually disjoint set of parity functions. f1 is s-sparse and the values of f1 are well separated . Further, f2 1 ν, (i.e., f is approximately sparse). If observations are corrupted by additive noise bounded by ϵ, then there exists an algorithm which takes ϵ + ν as an input, that gives g in time polynomial in n and 2s such that f g 2 O(ν + ϵ) with high probability. The treatment of the noisy case, i.e., the formal statement of this theorem, the corresponding algorithm, and the related proofs are relegated to the supplementary material. All these results are based on what we call as the unique sign property: If there is one value that f takes which uniquely specifies the signs of the parity functions involved, then the function is efficiently learnable. Note that our results cannot be used for learning juntas or other Boolean-valued sparse polynomials, since the unique sign property does not hold in these settings. We show that this property holds for the complement of the cut function on a hypergraph (no. of hyperedges cut value). This fact can be used to learn the cut complement function and eventually infer the structure of a sparse hypergraph from random cuts. Sparsity implies that the number of hyperedges and the size of each hyperedge is of constant size. Hypergraphs can be used to represent relations in many real world data sets. For example, one can represent the relation between the books and the readers (users) on the Amazon dataset with a hypergraph. Book titles and Amazon users can be mapped to nodes and hyperedges, respectively ([8]). Then a node belongs to a hyperedge, if the corresponding book is read by the user represented by that hyperedge. When such graphs evolve over time (and space), the difference graph filtered by time and space is often sparse. To locate and learn the few hyperedges from random cuts in such difference graphs constitutes hypergraph sketching. We test our algorithms on hypergraphs generated from the dataset that contain the time stamped record of messages between Yahoo! messenger users marked with the user locations (zip codes). 1.1 Approach and Related Work The problem of recovering the sparsest solution of a set of underdetermined linear equations has received significant recent attention in the context of compressed sensing [9 11]. In compressed sensing, one tries to recover an unknown sparse vector using few linear observations (measurements), possibly in the presence of noise. The recent papers [12,13] are of particular relevance to us since they establish a connection between learning sparse polynomials and compressed sensing. The authors show that the problem of learning a sparse polynomial is equivalent to recovering the unknown sparse coefficient vector using linear measurements. By applying techniques from compressed sensing theory, namely Restricted Isometry Property (see [12]) and incoherence (see [13]), the authors independently established results for reconstructing sparse polynomials using convex optimization. The results have near-optimal sample complexity. However, the running time of these algorithms is exponential in the underlying dimension, n. This is because the measurement matrix of the equivalent compressed sensing problem requires one column for every possible non-zero monomial. In this paper, we show how to solve this problem in time polynomial in n and 2s under the assumption of unique sign property on the sparse polynomial. Our key contribution is a novel identification procedure that can reduce the list of potentially non-zero coefficients from the naive bound of 2n to 2s when the function has this property. On the theoretical side, there has been interesting recent work of [14] that approximately learns sparse polynomial functions when the underlying domain is Gaussian. Their results do not seem to translate to the Boolean domain. We also note the work of [15] that gives an algorithm for learning sparse Boolean functions with respect to a randomly chosen product distribution on { 1, 1}n. Their work does not apply to the uniform distribution on { 1, 1}n. On the practical side, we give an application of the theory to the problem of hypergraph sketching. We generalize a prior work [12] that applied the compressed sensing approach discussed before to graph sketching on evolving social network graphs. In our algorithm, while the sample complexity requirements are higher, the time complexity is greatly reduced in comparison. We test our algorithms on a real dataset and show that the algorithm is able to scale well on sparse hypergraphs created out of Yahoo! messenger dataset by filtering through time and location stamps. 2 Definitions Consider a real-valued function over the Boolean hypercube f : { 1, 1}n R. Given a sequence of labeled samples of the form f(x), x , where x is sampled from the uniform distribution U over the hypercube { 1, 1}n, we are interested in an efficient algorithm that learns the function f with high probability. Through Fourier expansion, f can be written as a linear combination of monomials: S [n] c SχS(x), x { 1, 1}n (1) where [n] is the set of integers from 1 to n, χS(x) = Q i S xi and c S R. Let c be the vector of coefficients c S. A monomial χS (x) is also called a parity function. More background on Boolean functions and the Fourier expansion can be found in [16]. In this work, we restrict ourselves to sparse polynomials f with sparsity s in the Fourier domain, i.e., f is a linear combination of unknown parity functions χS1(x), χS2(x), . . . χSs (x) with s unknown real coefficients given by {c Si}s i=1 such that c Si = 0, 1 i s; all other coefficients are 0. Let the subsets corresponding to the s parity functions form a family of sets I = {Si}s i=1. Finding I is equivalent to finding the s parity functions. Note: In certain places, where the context makes it clear, we slightly abuse the notation such that the set Si identifying a specific parity function is replaced by just the index i. The coefficients may be denoted simply by ci and the parity functions by χi ( ). Let F2 denote the binary field. Every parity function χi( ) can be represented by a vector pi Fn 1 2 . The j-th entry pi(j) in the vector pi is 1, if j Si and is 0 otherwise. Definition 1. A set of s parity functions {χi( )}s i=1 are said to be linearly independent if the corresponding set of vectors {pi}s i=1 are linearly independent over F2. Similarly, they are said to have rank r if the dimension of the subspace spanned by {pi}s i=1 is r. Definition 2. The coefficients {ci}s i=1 are said to be in general position if for all possible set of values bi {0, 1, 1}, 1 i s, with at least one nonzero bi, s P i=1 cibi = 0 Definition 3. The coefficients {ci}s i=1 are said to be µ-separated if for all possible set of values bi {0, 1, 1}, 1 i s with at least one nonzero bi, Definition 4. A sign pattern is a distinct vector of signs a = [χ1 ( ) , χ2 ( ) , . . . χs ( ))] { 1, 1}1 s assumed by the set of s parity functions. Since this work involves switching representations between the real and the binary field, we define a function q that does the switch. Definition 5. q : { 1, 1}a b Fa b 2 is a function that converts a sign matrix X to a matrix Y over F2 such that Yij = q(Xij) = 1 F2, if Xij = 1 and Yij = q(Xij) = 0 F2, if Xij = 1. Clearly, it has an inverse function q 1 such that q 1(Y) = X. We also present some definitions to deal with the case when the polynomial f is not exactly s-sparse and observations are noisy. Let 2[n] denote the power set of [n]. Definition 6. A polynomial f : { 1, 1}n R is called approximately (s, ν)-sparse if there exists I 2[n] with |I| = s such that P S Ic|c S| < ν, where {c S} are the Fourier coefficients as in (1). In other words, the sum of the absolute values of all the coefficients except the ones corresponding to I are rather small. 3 Problem Setting Suppose m labeled samples f (x) , x m i=1 are drawn from the uniform distribution U on the Boolean hypercube. For any B 2[n], let c B R2n 1 be the vector of real coefficients such that c B(S) = c S, S B and c B(S) = 0, S / B. Let A Rm 2n be such that every row of A corresponds to one random input sample x U. Let x also denote the row index and S [n] denote the column index of A. A(x, S) = χS (x). Let AS denote the sub matrix formed by the columns corresponding to the subsets in S. Let I be the set consisting of the s parity functions of interest in both the sparse and the approximately sparse cases. A sparse representation of an approximately (s, ν)-sparse function f is f I = A(x) c I, where c I is as defined above. We review the compressed sensing framework used in [12] and [13]. Specifically, for the remainder of the paper, we rely on [13] as a point of reference. We review their framework and explain how we use it to obtain our results, particularly for the noisy case. Let y Rm and βS R2n, such that βS = 0, S Sc. Note that, here S is a subset of the power set 2[n]. Now, consider the following convex program for noisy compressed sensing in this setting: min βS 1 subject to 1 m AβS y 2 ϵ. (2) Let βopt S be an optimum for the program (2). Note that only the columns of A in S are used in the program. The convex program runs in time poly (m, |S|). The incoherence property of the matrix A in [13] implies the following. Theorem 3. ( [13]) For any family of subsets I 2[n] such that |I| = s, m = 4096ns2 and c1 = 4, c2 = 8, for any feasible point βS of program 2, we have: βS βopt S 2 c1ϵ + c2 n 1/4 βIc T S 1 (3) with probability at least 1 O 1 When S is set to the power set 2[n], ϵ = 0 and y is the vector of observed values for an s-sparse polynomial, the s-sparse vector c I is a feasible point to program (2). By Theorem 3, the program recovers the sparse vector c I and hence learns the function. The only caveat is that the complexity is exponential in n. The main idea behind our algorithms for noiseless and noisy sparse function learning is to capture the actual s-sparse set I of interest in a small set S : |S| = O (2s) of coefficients by a separate algorithm that runs in time poly(n, 2s). Using the restricted set of coefficients S, we search for the sparse solution under the noisy and noiseless cases using program (2). Lemma 1. Given an algorithm that runs in time poly(n, 2s) and generates a set of parities S such that |S| = O (2s) , I S with |I| = s, program (2) with S and m = 4096ns2 random samples as inputs runs in time poly(n, 2s) and learns the correct function with probability 1 O 1 Unique Sign Pattern Property: The key property that lets us find a small S efficiently is the unique sign pattern property. Observe that an s-sparse function can produce at most 2s different real values. If the maximum value obtained always corresponds to a unique pattern of signs of parities, by looking only at the random samples x corresponding to the subsequent O(n) occurrences of this maximum value, we show that all the parity functions needed to learn f are captured in a small set of size 2s+1 (see Lemma 2 and its proof). The unique sign property again plays an important role, along with Theorem 3 with more technicalities added, in the noisy case, which we visit in Section 2 of the supplementary material. In the next section, we provide an algorithm to generate the bounded set S for the noiseless case for an s-sparse function f and provide guarantees for the algorithm formally. 4 Algorithm and Guarantees: Noiseless case Let I be the family of s subsets {Si}s i=1 each corresponding to the s parity functions χSi ( ) in an s-sparse function f. In this section, we provide an algorithm, named Learn Bool, that finds a small subset S of the power set 2[n] that contains elements of I first and then uses program (2) with S. We show that the algorithm learns f in time poly (n, 2s) from uniformly randomly drawn labeled samples from the Boolean hypercube with high probability under some natural conditions. Recall that if the function is such that f(x) attains its maximum value only if [χ1(x), χ2 (x) . . . χs (x)] = amax { 1, 1}s for some unique sign pattern amax, then the function is said to possess the unique sign property. Now we state the main technical lemma for the unique sign property. Lemma 2. If an s-sparse function f has the unique sign property then, in Algorithm 1, S is such that I S, |S| 2s+1 with probability 1 O 1 n and runs in time poly(n, 2s). Proof. See the supplementary material. The proof of the above lemma involves showing that the random matrix Ymax (see Algorithm 1) has rank at least n s, leading to at most 2s solutions for each equation in (4). The feasible solutions can be obtained by Gaussian elimination in the binary field. Theorem 4. Let f be an s-sparse function that satisfies at least one of the following properties: (a) The coefficients {ci}s i=1 are in general position. (b) The s parity functions are linearly independent. (c) All the coefficients are positive. Given labeled samples, Algorithm 1 learns f exactly (or vopt = c) in time poly (n, 2s) with probability 1 O 1 Proof. See the supplementary material. Smoothed Analysis Setting: Perturbing ci s with Gaussian random variables of standard deviation σ > 0 or by random variables drawn from any set of reasonable continuous distributions ensures that the perturbed function satisfies property (a) with probability 1. Random Parity Functions: When ci s are arbitrary and the set of s parity functions are drawn uniformly randomly from 2[n], then property (b) holds with high probability if s is a constant. Input: Sparsity parameter s, m1 = 2n2s random labeled samples { f (xi) , xi }m1 i=1. Pick samples {xij}nmax j=1 corresponding to the maximum value of f observed in all the m samples. Stack all xij row wise into a matrix Xmax of dimensions nmax n. Initialise S = . Let Ymax = q (Xmax). Find all feasible solutions p Fn 1 2 such that: 1nmax 1 = Ymaxp or 0nmax 1 = Ymaxp (4) Collect all feasible solutions p to either of the above equations in the set P Fn 1 2 . S = {{j [n] : p(j) = 1}|p P}. Using m = 4096ns2 more samples (number of rows of A is m corresponding to these new samples), solve: βopt S = min βS 1 such that AβS = y, (5) where y is the vector of m observed values. Set vopt = βopt S . Output: vopt. Algorithm 1: Learn Bool 5 A Sparse Polynomial Learning Application: Hypergraph Sketching Hypergraphs can be used to model the relations in real world data sets (e.g., books read by users in Amazon). We show that the cut functions on hypergraphs satisfy the unique sign property. Learning a cut function of a sparse hypergraph from random cuts is a special case of learning a sparse polynomial from samples drawn uniformly from the Boolean hypercube. To track the evolution of large hypergraphs over a small time interval, it is enough to learn the cut function of the difference graph which is often sparse. This is called the graph sketching problem. Previously, graph sketching was applied to social network evolution [12]. We generalize this to hypergraphs showing that they satisfy the unique sign property, which enable faster algorithms, and provide experimental results on real data sets. 5.1 Graph Sketching A hypergraph G = (V, E) is a set of vertices V along with a set E of subsets of V called the hyperedges. The size of a hyperedge is the number of variables that the hyperedge connects. Let d be the maximum hyperedge size of graph G. Let |V | = n and |E| = s. A random cut S V is a set of vertices selected uniformly at random. Define the value of the cut S to be c(S) = |{e E : e T S = , e T V S = }|. Graph sketching is the problem of identifying the graph structure from random queries that evaluate the value of a random cut, where s n (sparse setting). Hypergraphs naturally specify relations among a set of objects through hyperedges. For example, Amazon users can form the set E and Amazon books can form the set V . Each user may read a subset of books which represents the hyperedge. Learning the hypergraph corresponds to identifying the sets of books bought by each user. For more examples of hypergraphs in real data sets, we refer the reader to [8]. Such hypergraphs evolve over time. The difference graph between two consecutive time instants is expected to be sparse (number of edges s and maximum hyperedge size d are small). We are interested in learning such hypergraphs from random cut queries. For simplicity and convenience, we consider the cut complement query, i.e., c cut, which returns s c(S). One can easily represent the c cut query with a sparse polynomial as follows: Let node i correspond to variable xi { 1, +1}. A random cut involves choosing xi uniformly randomly from { 1, +1}. The variables assigned to +1 belong to the random cut S. The value is given by the polynomial fc cut(x) = X J I, |J |is even Hence, the c cut function is a sparse polynomial where the sparsity is at most s2d 1. The variables corresponding to the nodes that belong to some hyperedge appear in the polynomial. We call these the relevant variables and the number of relevant variables is denoted by k. Note that, in our sparse setting k sd. We note that for a hypergraph with no singleton hyperedge, given the c cut function, it is easy to recover the hyper edges from (6). Therefore, we focus on learning the c cut function to sketch the hypergraph. When G is a graph with edges (of cardinality 2), the compressed sensing approach (using program 2) using the cut (or c cut) values as measurements is shown to be very efficient in [12] in terms of the sample complexity, i.e., the required number of queries. The run time is efficient because total number of candidate parities is O(n2). However when we consider hypergraphs, i.e., when d is a large constant, the compressed sensing approach cannot scale computationally (poly(nd) runtime). Here, based on the theory developed, we give a faster algorithm based on the unique sign property with sample complexity m1 = O(2kd log n + 22d+1s2(log n + k)) and run time of O(m12k, n2 log n)). We observe that the c cut polynomial satisfies the unique sign property. From (6), it is evident that the polynomial has only positive coefficients. Therefore, by Theorem 4, algorithm Learn Bool succeeds. The maximum value of the c cut function is the number of edges. Notice that the maximum value is definitely observed in two configurations of the relevant variables: If either all relevant variables are +1 or all are 1. Therefore, the maximum value is observed in every 2k 1 2sd samples. Thus, a direct application of Learn Bool yields poly(n, 2k 1) time complexity, which improves the O(nd) bound for small s and d. Improving further, we provide a more efficient algorithm tailored for the hypergraph sketching problem, which makes use of the unique sign property and some other properties of the cut function. Algorithm Learn Graph (Algorithm 4) is provided in the supplementary material. 0 200 400 600 800 1000 10 0 10 4 Runtime of Learn Graph vs. standard compressed sensing No. of variables, n Runtime (seconds) Learn Graph Comp. Sensing (a) Runtime vs. # of variables, d = 3 and s = 1. 1 2 3 4 5 6 7 8 9 10 0.1 α (# of samples/n) Prob. of Error Error Probability vs. α Setting 1 Setting 3 Setting 2 Setting 4 (b) Probability of error vs. α. Figure 1: Performance figures comparing Learn Graph and Compressed Sensing approach. Theorem 5. Algorithm 4 exactly learns the c cut function with probability 1 O( 1 n)with sample complexity m1 = O(2kd log n + 22d+1s2(log n + k)) and time complexity O(2km1 + n2d log n)) . Proof. See the supplementary material. 5.2 Yahoo! Messenger User Communication Pattern Dataset We performed simulations using MATLAB on an Intel(R) Xeon(R) quad-core 3.6 GHz machine with 16 GB RAM and 10M cache. We run our algorithm on the Yahoo! Messenger User Communication Pattern Dataset [17]. This dataset contains the timestamped user communication data, i.e., information about a large number of messages sent over Yahoo! Messenger, for a duration of 28 days. Dataset: Each row represents a message. The first two columns show the day and time (time stamp) of the message respectively. The third and fifth columns show the ID of the transmitting and receiving users, respectively. The fourth column shows the zipcode (spatial stamp) from which this particular message is transmitted. The sixth column shows if the transmitter was in the contact list of the reciver user (y) or not (n). If a transmitter sends the same receiver more than one message from the same zipcode, only the first message is shown in the dataset. In total, there are 100000 unique users and 5649 unique zipcodes. We form a hypergraph from the dataset as follows: The transmitting users form the hyperedges and the receiving users form the nodes of the hypergraph. A hyperedge connects a set T of users if there is a transmitting user that sends a message to all the users in T. In any given time interval δt (short time interval) and small set of locations δx specified by the number of zip codes, there are few users who transmit (s) and they transmit to very few users (d). The complete set of nodes in the hypergraph (n) is taken to be those receiving users who are active during m consecutive intervals of length δt and in a set of δx zipcodes. This gives rise to a sparse graph. We identify the active set of transmitting users (hyperedges) and their corresponding receivers (nodes in these hyperedges) during a short time interval δt and a randomly selected space interval (δx, i.e., zip codes) from a large pool of receivers (nodes) that are observed during m intervals of length δt. Details of δt, m and δx chosen for experiments are given in Table 1. We note that n is in the order of 1000 usually. Remark: Our task is to learn the c cut function from the random queries, i.e., random +/-1 assignment of variables and corresponding c cut values. The generated sparse graph contains only hyperedges that have more than 1 node. Other hyperedges (transmitting users) with just one node in the sparse hypergraph are not taken into account. This is because a singleton hyperedge i is always counted in the c cut function thereby effectively its presence is masked. First, we identify the relevant variables that participate in the sparse graph. After identifying this set of candidates, correlating the corresponding candidate parities with the function output yields the Fourier coefficient of that parity (see Algorithm 4). Table 1: Runtime for different graphs. LG: Learn Graph, CS: Compressed sensing based alg. (a) Runtime for d = 4 and s = 1 graph. HHH Alg. n 88 159 288 556 1221 LG 1.96 2.13 2.23 2.79 4.94 CS 265.63 - - - - (b) Runtime for d = 4 and s = 3 graph. HHH Alg. n 52 104 246 412 1399 LG 1.91 2.08 2.08 2.30 4.98 CS 39.89 > 10823 - - - (c) Simulation parameters for Fig. 1b Setting No. Interval # of Int. n max(d) max(s) Zip. Set Size Setting 1 5 min. 20 6822 10 19 20 Setting 2 20 sec. 200 5730 22 4 200 Setting 3 10 min. 10 6822 11 13 2 Setting 4 2 min. 50 6822 30 21 50 5.2.1 Performance Comparison with Compressed Sensing Approach First, we compare the runtime of our implementation Learn Graph with the compressed sensing based algorithm from [12]. Both algorithms correctly identify the relevant variables in all the considered range of parameters. The last step of finding the corresponding Fourier coefficients is omitted and can be easily implemented (Algorithm 4) without significantly affecting the running time. As can be seen in Tables 1a, 1b and Fig. 1a, Learn Graph scales well to graphs on thousands of nodes. On the contrary, the compressed sensing approach must handle a measurement matrix of size O(nd), which becomes prohibitively large on graphs involving more than a few hundred nodes. 5.2.2 Error Performance of Learn Graph Error probability (probability that the correct c cut function is not recovered) versus the number of samples used is plotted for four different experimental settings of δt, δx and m in Fig. 1b. For each time interval, the error probability is calculated by averaging the number of errors among 100 different trials. For each value of α (number of samples), the error probability is averaged over time intervals to illustrate the error performance. We only keep the intervals for which the graph filtered with the considered zipcodes contains at least one user with more than one neighbor. We find that for the first 3 settings, the error probability decreases with more samples. For the fourth setting, d and s are very large and hence a large number of samples are required. For that reason, the error probability does not improve significantly. The probability of error can be reduced by repeating the experiment multiple times and taking a majority, at the cost of significantly more samples. Our plot shows only the probability of error without such a majority amplification. 6 Conclusions We presented a novel algorithm for learning sparse polynomials by random samples on the Boolean hypercube. While the general problem of learning all sparse polynomials is notoriously hard, we show that almost all sparse polynomials can be efficiently learned using our algorithm. This is because our unique sign property holds for randomly perturbed coefficients, in addition to several other natural settings. As an application, we show that graph and hypergraph sketching lead to sparse polynomial learning problems that always satisfy the unique sign property. This allows us to obtain efficient reconstruction algorthms that outperform the previous state of the art for these problems. An important open problem is to achieve the sample complexity of [12] while keeping the computational complexity polynomial in n. Acknowledgments M.K, K.S. and A.D. acknowledge the support of NSF via CCF 1422549, 1344364, 1344179 and DARPA STTR and a ARO YIP award. [1] E. Kushilevitz and Y. Mansour, Learning decision trees using the Fourier spectrum, in SIAM J. Comput., vol. 22, no. 6, 1993, pp. 1331 1348. [2] Y. Mansour, Randomized interpolation and approximation of sparse polynomials, in SIAM J. Comput., vol. 24, no. 2. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1995, pp. 357 368. [3] R. Schapire and R. Sellie, Learning sparse multivariate polynomials over a field with queries and counterexamples, in JCSS: Journal of Computer and System Sciences, vol. 52, 1996. [4] A. C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, Near-optimal sparse Fourier representations via sampling, in Proceedings of STOC, 2002, pp. 152 161. [5] P. Gopalan, A. Kalai, and A. Klivans, Agnostically learning decision trees, in Proceedings of STOC, 2008, pp. 527 536. [6] A. Akavia, Deterministic sparse Fourier approximation via fooling arithmetic progressions, in Proceedings of COLT, 2010, pp. 381 393. [7] D. Spielman and S. Teng, Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time, in JACM: Journal of the ACM, vol. 51, 2004. [8] P. Li, Relational learning with hypergraphs, Ph.D. dissertation, Ecole Polytechnique F ed erale de Lausanne, 2013. [9] E. J. Cand es, J. Romberg, and T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information, Information Theory, IEEE Transactions on, vol. 52, no. 2, pp. 489 509, 2006. [10] E. J. Cand es and T. Tao, Decoding by linear programming, Information Theory, IEEE Transactions on, vol. 51, no. 12, pp. 4203 4215, 2005. [11] D. L. Donoho, Compressed sensing, Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 1289 1306, 2006. [12] P. Stobbe and A. Krause, Learning Fourier sparse set functions, in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2012, pp. 1125 1133. [13] S. Negahban and D. Shah, Learning sparse boolean polynomials, in Proceedings of the Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 2012, pp. 2032 2036. [14] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang, Learning sparse polynomial functions, in Proceedings of SODA, 2014. [15] A. T. Kalai, A. Samorodnitsky, and S.-H. Teng, Learning and smoothed analysis, in Proceedings of FOCS. IEEE Computer Society, 2009, pp. 395 404. [16] R. O Donnell, Analysis of Boolean Functions. Cambridge University Press, 2014. [17] Yahoo, Yahoo! webscope dataset ydata-ymessenger-user-communication-pattern-v1 0, http: //research.yahoo.com/Academic Relations.