# exact_inference_in_highorder_structured_prediction__97e9714e.pdf

Exact Inference in High-order Structured Prediction

Chuyang Ke 1 Jean Honorio 1

In this paper, we study the problem of inference in high-order structured prediction tasks. In the context of Markov random ﬁelds, the goal of a highorder inference task is to maximize a score function on the space of labels, and the score function can be decomposed into sum of unary and highorder potentials. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also provide a new class of hypergraph structural properties related to hyperedge expansion that drives the success in general high-order inference problems. Finally, we connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.

1. Introduction

Structured prediction has been widely used in various machine learning ﬁelds in the past 20 years, including applications like social network analysis, computer vision, molecular biology, natural language processing (NLP), among others. A common objective in these tasks is assigning / recovering labels, that is, given some possibly noisy observation, the goal is to output a group label for each entity in the task. In social network analysis, this could be detecting communities based on user proﬁles and preferences (Ke & Honorio, 2022b; Kelley et al., 2012). In computer vision, researchers want the AI to decide whether a pixel is in the foreground or background (Nowozin et al., 2011). In biology, it is sometimes desirable to cluster molecules by structural similarity (Nugent & Meila, 2010). In NLP, partof-speech tagging is probably one of the most well-known structured prediction task (Weiss & Taskar, 2010).

From a methodological point of view, a standard approach in

1Department of Computer Science, Purdue University, West Lafayette, IN, USA. Correspondence to: Chuyang Ke <cke@purdue.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

the structured prediction tasks above, is to recover the global structure by exploiting many local structures. Take social networks as an example. A widely used assumption in social network analysis is homophily users with similar proﬁles and preferences are more likely to become friends (Ke & Honorio, 2018; 2019). Intuitively, a structured prediction algorithm tends to assign two users the same label, if they have a higher afﬁnity score. Similarly, the same idea can be motivated in the context of Markov random ﬁelds (MRFs). Assume all entities form an undirected graph G = (V, E), structured prediction can be viewed as the task of solving the following inference problem (Bello & Honorio, 2019):

maximize y L|V| X

v V,l L cv(l) 1[yv = l]

(v1,v2) E l1,l2 L

cv1,v2(l1, l2) 1[yv1 = l1, yv2 = l2] , (1)

where L is the space of labels, cv(l) is the score of assigning label l to node v, and cv1,v2(l1, l2) is the score of assigning labels l1 and l2 to neighboring nodes v1 and v2. In the MRF and inference literature, the two terms in (1) are often referred to as unary and pairwise potentials, respectively. The inference formulation above allows one to recover the global structure, by ﬁnding a conﬁguration that maximizes the summation of unary and pairwise local scores.

However, entities in many real-world problems could interact beyond the pairwise fashion. Take the social network example again, but this time let us focus on the academia co-authorship network: many published papers are written by more than two authors (Liu et al., 2005). Such highorder interactions cannot be captured by pairwise structures. Geometrically, the co-authorship network can no longer be represented by a graph. As a result, the introduction of hypergraphs is necessary to model high-order structured prediction problems.

In this paper, we study the problem of high-order structured prediction, in which instead of using pairwise potentials, high-order potentials are considered. Using the MRF formulation, we are interested in inference problems of the

Exact Inference in High-order Structured Prediction

following form:

maximize y L|V| X

v V,l L cv(l) 1[yv = l]

e E l1,...,lm L e=(v1,...,vm)

ce(l1, . . . , lm) 1[yv1 = l1, . . . , yvm = lm] ,

where m is the order of the inference problem as well as the hypergraph (each hyperedge connects m vertices), and ce(l1, . . . , lm) is the score of assigning labels l1 through lm to neighboring nodes v1 through vm connected by hyperedge e E.

1.1. Inference as a Recovery Task

Structured prediction and inference problems with unary and pairwise potentials in the form of (1) have been studied in prior literature. Globerson et al. (2015) introduced the problem of label recovery in the case of two-dimensional grid lattices, and analyzed the conditions for approximate inference. Along the same line of work, Foster et al. (2018) generalized the model by allowing tree decompositions. Another ﬂavor is the problem of exact inference, for which Bello & Honorio (2019) proposed a convex semideﬁnite programming (SDP) approach. In these works, the problem of label recovery is motivated by a generative model, which assumes the existence of a ground truth label vector y , and generates (possibly noisy) unary and pairwise observations based on label interactions.

Unfortunately, little is known for structured prediction with high-order potentials and hypergraphs. In recent years, there have been various attempts to generalize some graph properties (including hypergraph Laplacian, Rayleigh quotient, hyperedge expansion, Cheeger constant, among others) to hypergraphs (Li & Milenkovic, 2018; Mulas, 2021; Yoshida, 2019; Chan et al., 2018; Chen et al., 2017; Chang et al., 2020). However, it took a long time for us to ﬁnd out that due to the nature of structured prediction tasks, the hypergraph deﬁnitions must fulﬁll certain properties, and none of the deﬁnitions in the aforementioned works fulﬁll those. This makes it challenging to design hypergraph-based label recovery algorithms. Furthermore, none of the aforementioned works provide guarantees of either approximate or exact inference.

In this work, we apply a generative model approach to study the problem of high-order inference. We analyze the task of label recovery, and answer the following question:

Problem 1 (Label Recovery). Is there any algorithm that takes noisy unary and high-order local observation as the input, and correctly recovers the underlying true labels?

1.2. Inference and Structural Properties

In the MRF inference literature, a central and longtime discussion focuses on inference solvability versus certain structural properties of the problem.

To see this, we ﬁrst revisit various classes of structural properties in pairwise inference problems (i.e., in the form of (1)). Chandrasekaran et al. (2012) studied treewidth in graphs as a structural property, and showed that graphs with low treewidths are solvable. Schraudolph & Kamenetsky (2008) demonstrated that planar graphs can be solved. Boykov & Veksler (2006) analyzed graphs with binary labels and sub-modular pairwise potentials. Bello & Honorio (2019) showed that inference with graphs that are good expanders, or bad expanders plus an Erdos-Renyi random graph, can be achieved. It is worth highlighting that the structural properties above are not directly comparable or reducible. Instead, they characterize the difﬁculty of an inference problem from different angles, or in other words, for different classes of graphs.

Similar discussions about inference versus structural properties exist in high-order MRF inference literature. For example, Komodakis & Paragios (2009) investigated hypergraphs fulﬁlling the property of one sub-hypergraph per clique, and proved that high-order MRF inference can be achieved through solving linear programs (LPs). Fix et al. (2014) studied inference in hypergraphs with the property of local completeness. Gallagher et al. (2011) analyzed the performance of high-order MRF inference through order reduction versus the number of non-submodular edges. However, these works do not provide theoretical guarantees of either approximate or exact inference.

In this paper, we provide a new class of hypergraph structural properties by analyzing hyperedge expansion. In order to get some intuition, let us consider a social network with two disconnected sub-networks. With an ideal algorithm, we may recover the user communities in subnet 1 and the those in subnet 2, but since there is no interaction between the two subnets at all, we will not be able to infer the global community structure (e.g., the relationship between the recovered communities in subnet 1 and subnet 2). A less extreme case is networks with bottlenecks, i.e., removing these bottleneck edges will disconnect the network. For similar reasons, one can imagine that inference in networks with bottlenecks can be hard if noise is present. See Figure 1 for an illustration. In pairwise graphs (2-graphs), such connectivity / bottleneck property can be characterized by the edge expansion (i.e., the Cheeger constant) of the graph. Characterizing similar expansion properties in high-order hypergraphs poses a challenge, especially if one wants to relate such topological properties to the conditions of exact inference.

Problem 2 (Structural Property). Under what topological

Exact Inference in High-order Structured Prediction

(a) Disconnected graph (zero edge expansion)

(b) Graph with a bottleneck (small edge expansion)

(c) Graph without bottlenecks (large edge expansion)

Figure 1. Graph expansion examples. Figure 1a shows a disconnected graph. With an ideal algorithm we may be able to recover the user communities in the top subgraph and in the bottom subgraph, but since there is no observed interaction between the two, we will not be able to infer the global structure. Figure 1b connects the two components in Figure 1a using a single orange edge. In this case the orange edge is the bottleneck. Removing the orange edge disconnects the graph. Figure 1c adds two more edges to Figure 1b. In this case every component is connected with no weak bottleneck. Edge expansion is a structural property, which characterizes how connected the components in a graph are.

conditions will our label recovery algorithm work correctly with high probability?

Summary of our contribution. Our work is highly theoretical. We provide a series of novel deﬁnitions and results in this paper:

We provide a new class of hypergraph structural properties for high-order inference problems. We derive a novel Cheeger-type inequality, which relates the tensor spectral gap of a hypergraph Laplacian to a Cheegertype hypergraph expansion property. These hypergraph results are not only limited to the scope of the model in this paper, but also can be helpful to researchers working on high-order inference problems.

We propose a two-stage approach to solve the problem of high-order structured prediction. We formulate the label recovery problem as a high-order combinatorial optimization problem, and further relax it to a novel convex conic form optimization problem.

We carefully analyze the Karush Kuhn Tucker (KKT) conditions of the conic form optimization problem, and derive the sufﬁcient statistical and topological conditions for exact inference. Our KKT analysis guarantees the solution to be optimal with a high probability, as long as the conditions are fulﬁlled.

2. Preliminaries

In this section, we formally deﬁne the high-order exact inference problem and introduce the notations that will be used throughout the paper.

We use lowercase font (e.g., a, b, u, v) to denote scalars and vectors, and uppercase font (e.g., A, B, C) to denote tensors. We denote the set of real numbers by R.

For any natural number n, we use [n] to denote the set {1, . . . , n}.

We use 1 to denote the all-ones vector.

For clarity we use superscripts (i) to denote the i-th object in a sequence of objects, and subscripts j to denote the j-th entry. We use to denote the Hadamard product, and to denote the outer product. Let v(1), . . . , v(m) Rn be a sequence of m vectors of dimension n. Then v(1) . . . v(m) is a tensor of order m and dimension n (or equivalently, of shape n m), such that (v(1) . . . v(m))i1,...,im = v(1) i1 . . . v(m) im .

2.1. Tensor Deﬁnitions

Let A Rn m be an m-th order, n-dimensional real tensor. Throughout the paper, we limit our discussion to m = 2, 6, 10, 14, . . . for clarity of exposition. While other even orders (m = 4, 8, 12 . . . ) are possible and a similar analysis will follow, the hypergraph deﬁnitions will be involving many more terms and the paper will be less readable. See Remark 3.18 for discussion.

A symmetric tensor is invariant under any permutation of the indices. In other words, A is symmetric if for any permutation σ : [m] [m], we have Aσ(i1,...,im) = Ai1,...,im.

We deﬁne the inner product of two tensors A, B of the same shape as A, B := Pn i1,...,im=1 Ai1,...,im Bi1,...,im. We deﬁne the tensor Frobenius norm as A F := p

Exact Inference in High-order Structured Prediction

A symmetric tensor A is positive semideﬁnite (PSD), if for all v Rn, we have A, v m 0. We use Sn,m + to denote the convex cone of all m-order, n-dimensional PSD tensors.

The dual cone of Sn,m + is the Caratheodory tensor cone S ,n,m + , which is deﬁned as S ,n,m + := P( m+n 1 m ) i=1 v(i) m | v(i) Rn . In other words, every

tensor in S ,n,m + is the summation of at most m+n 1 m rankone tensors. Sn,m + and S ,n,m + are dual to each other (Ke & Honorio, 2022a).

For any tensor A Rn m, we deﬁne its minimum tensor eigenvalue λmin(A) (or equivalently λ1(A)) using a variational characterization, such that λmin(A) := minv Rn, v =1 A, v m . Similarly we deﬁne its second minimum tensor eigenvalue λ2(A) as λ2(A) := minv Rn, v =1,v v(1) A, v m , where v(1) is the eigenvector corresponding to λmin(A).

We denote σn,m 2 as the index set of m-tuples in the shape of σ(i1, i1, i2, i2, . . . , im/2, im/2), for any permutation σ : [m] [m] and ij [n]. Intuitively, in every tuple of σn,m 2 , every index repeats an even number of times. We use σn,m 2 to denote the set {(i1, . . . , im) | (i1, . . . , im) / σn,m 2 , and at least one index repeats twice}. In other words, σn,m 2 is the complement of σn,m 2 , subtracting cases with all unique indices.

2.2. High-order Inference Task

We consider the task of predicting a set of n vertex labels y = (y 1, . . . , y n), where y i {+1, 1}, from a set of observations X and z. X and z are noisy observations generated from some underlying m-uniform hypergraph G = (V, E). In particular, V is the set of vertices (nodes) with |V| = n, and E is the set of hyperedges.

For every possible m-vertex tuple e = (i1, . . . , im), if e E, the hyperedge observation Xi1,...,im (and all corresponding symmetric entries Xσ(i1,...,im)) is sampled to be y 1 y m with probability 1 p, and y 1 y m with probability p independently. If e / E, Xi1,...,im is set to 0.

For every node vi V, the node observation zi is sampled to be y i with probability 1 q, and y i with probability q independently.

We now summarize the generative model.

Deﬁnition 2.1 (High-order Structured Prediction with Partial Observation). Unknown: True node labeling vector y = (y 1, . . . , y n). Observation: Partial and noisy hyperedge observation tensor X { 1, 0, +1}n m. Noisy node label observation vector z { 1, +1}n. Task: Infer and recover the correct node labeling vector y from the observation X and z.

3. Hypergraph Structural Properties and Cheeger-type Inequality

In this section, we introduce a series of novel Cheeger-type analysis for hypergraphs. This allows us to characterize the spectral gap of a hypergraph Laplacian, via the topological hyperedge expansion of the graph itself. Hypergraph theorems in this section are general, and are not limited to the speciﬁc model covered in our inference task. To the best of our knowledge, the following high-order deﬁnitions and results are novel. All missing proofs of the lemmas and theorems can be found in Appendix A.

3.1. Hypergraph Topology

We ﬁrst introduce the necessary hypergraph topological deﬁnitions. Deﬁnition 3.1 (Induced Hypervertices). Given an muniform hypergraph G = (V, E), we use

H := {{i1, . . . , im/2} | i1, . . . , im/2 [n]}

to denote the set of induced hypervertices. We denote its cardinality by N := |H| = n m/2 .

Deﬁnition 3.2 (Boundary of a Hypervertex Set). For any hypervertex set S H, we denote its boundary set as

S := {h1 h2 | h1 S, h2 / S, h1 h2 = , h1 h2 E} .

Note that S is a set of m-th order hyperedges. Deﬁnition 3.3 (Hyperedge Expansion of a Hypervertex Set). For any hypervertex set S H, we denote the hyperedge expansion of the set S as

Deﬁnition 3.4 (Hyperedge Expansion of a Hypergraph). Given an m-uniform hypergraph G = (V, E) with induced hypervertices H, we denote the hyperedge expansion of the hypergraph G as

φG := min S H,0<|S| N/2 φS = min S H,0<|S| N/2 | S|

We also call φG the Cheeger constant of the hypergraph.

3.2. Hypergraph Laplacian

In this section, we introduce our hypergraph Laplacian related deﬁnitions. Deﬁnition 3.5 (ζ-function). ζ : Rm R is a function deﬁned as

ζ(v1, . . . , vm) = X

I [m],|I|=m/2

Exact Inference in High-order Structured Prediction

Deﬁnition 3.6 (Hypergraph Laplacian). Given an muniform hypergraph G = (V, E), we use L Rn m to denote its Laplacian tensor, which fulﬁlls

L, v m = 1 m! m m/2 X

(i1,...,im) E ζ(vi1, . . . , vim) .

Remark 3.7. Deﬁnition 3.5 and 3.6 are the generalization of graph Laplacians to hypergraph cases. To see this, note that for any pairwise graph (m = 2), its graph Laplacian fulﬁlls the following quadratic form property v Lv = 1

(i,j) E(vi vj)2. This is exactly equivalent to setting m = 2 in the deﬁnitions above.

Deﬁnition 3.8 (Rayleigh Quotient). For any hypergraph Laplacian L and non-zero vector v Rn, the Rayleigh quotient RL(v) is deﬁned as

RL(v) := L, v m

Deﬁnition 3.9 (Signed Laplacian Tensor). Given an muniform hypergraph G = (V, E) with a sign vector y { 1, +1}n, we use Ly Rn m to denote its signed Laplacian tensor, which fulﬁlls

Ly, v m = 1 m! m m/2 X

(i1,...,im) E ζ(yi1vi1, . . . , yimvim) .

Remark 3.10. The hypergraph Laplacian tensor L can be viewed as a signed Laplacian tensor Ly, by taking y = 1.

Here we provide some important properties of hypergraph Laplacians and Rayleigh quotients.

Lemma 3.11 (Laplacian Eigenpair). For any hypergraph Laplacian L, 1 is an eigenvector of L with a minimum eigenvalue of 0. Similarly, for any signed hypergraph Laplacian Ly, y is an eigenvector of Ly with an eigenvalue of 0.

Proof of Lemma 3.11. This follows directly from the deﬁnition of ζ-function, and Deﬁnition 3.6, 3.9.

Lemma 3.12 (Invariant under Scaling). For any non-zero α R, we have

RL(v) = RL(αv) .

Proof of Lemma 3.12.

RL(αv) = 1 m! m m/2 X

(i1,...,im) E

ζ(αvi1, . . . , αvim)

= 1 m! m m/2 X

(i1,...,im) E

αmζ(vi1, . . . , vim)

= 1 m! m m/2 X

(i1,...,im) E

ζ(vi1, . . . , vim)

Lemma 3.13 (Invariant under Scaling). For any δ R and v Rn, v 1, we have

RL(v) RL(v + δ1) .

Proof of Lemma 3.13. Note that

RL(v + δ1) = 1 m! m m/2 X

(i1,...,im) E

ζ(vi1 + δ, . . . , vim + δ)

= 1 m! m m/2 X

(i1,...,im) E

ζ(vi1, . . . , vim)

1 m! m m/2 X

(i1,...,im) E

ζ(vi1, . . . , vim)

where the inequality follows from the fact that

i (vi + δ)2 !m/2

i v2 i + nδ2 + 2δ X

i v2 i + nδ2 !m/2

Lemma 3.14 (Signed Rayleigh Quotient Lower Bound). For any hypergraph with Laplacian L and signed Laplacian Ly, and for any δ R and v Rn, v y, we have

RLy(v) RL(v y + δ1) .

Exact Inference in High-order Structured Prediction

Lemma 3.15 (Existence of Degree Tensor). For any hypergraph Laplacian L, let A denote the corresponding symmetric adjacency tensor, such that Ai1,...,im = 1 if and only if (i1, . . . , im) E. Then, there exists a high-order degree tensor D fulﬁlling: 1) Di1,...,im = 0 if i1, . . . , im are all different, and 2) D A, v m = L, v m for every vector v Rn.

3.3. High-order Cheeger-type Inequality

Here, we connect the previous two sections and provide our novel high-order Cheeger-type inequality. Theorem 3.16 (High-order Cheeger-type Inequality). For any m-uniform hypergraph G = (V, E) with Laplacian L, we have λ2(L) φm G .

Remark 3.17. Here we would like to discuss our choice of hypergraph Laplacian as well as the related high-order deﬁnitions. Despite the fact that there have been multiple works trying to generalize some graph properties (including hypergraph Laplacian, Rayleigh quotient, hyperedge expansion, Cheeger constant, among others) to hypergraphs (Li & Milenkovic, 2018; Mulas, 2021; Yoshida, 2019; Chan et al., 2018; Chen et al., 2017; Chang et al., 2020), we want to highlight that none above fulﬁll the requirement of our high-order inference tasks.

To obtain a set of hypergraph deﬁnitions that are useful for our inference task, we need to fulﬁll all the lemmas in this section. In particular, here are the necessary conditions:

To fulﬁll Lemma 3.12, each term in the parenthesis in the ζ-function (Deﬁnition 3.5) must have a coefﬁcient of 1 or 1.

To fulﬁll Lemma 3.13, the ζ-function (Deﬁnition 3.5) and the norm in the denominator of the Rayleigh quotient (Deﬁnition 3.8) must have the same power. Additionally, the summation of coefﬁcients of the terms in the parenthesis in the ζ-function (Deﬁnition 3.5) must be 0. Combining with the last condition, it requires the number of plus and minus terms to be equal.

To fulﬁll Lemma 3.15, the number of minus terms in the parenthesis in the ζ-function (Deﬁnition 3.5) must be odd, so that A, v m can be canceled by L, v m in those entries without repeating indices.

To fulﬁll Lemma 3.11, the Rayleigh quotient must achieve minimum with v = 1.

At this point it should be clear that we cannot simply use some arbitrary deﬁnition from the prior literature. Our hypergraph deﬁnitions are carefully constructed and chosen, to fulﬁll all the necessary conditions above.

Remark 3.18. Regarding the choice of order m, note that if m is odd, the task of label recovery will not be feasible in the current formulation, since the Rayleigh quotient will be unbounded below (instead of a minimum 0). From Remark 3.17 it can be noticed that if we keep the current deﬁnition of ζ-functions, the only possible m s are 2, 6, 10, 14, . . . The deﬁnition of ζ-functions can be generalized for other even orders, such that for every m-tuple I, we take the average of many m-power terms by iterating through the permutation of all possible signs. For clarity of exposition we keep the current simpler deﬁnition, and focus on the case of m = 2, 6, 10, 14, . . .

4. Exact Recovery of True Labels

In this section, we present an inference algorithm, which recovers the underlying true labels in our model. To do so, we take a two stage approach. In the ﬁrst stage we only utilize the hyperedge information observed from X, and this allows us to narrow down our solution space to two possible solutions. In the second stage, with the help of the node information observed from z, we are able to infer the correct labeling of the nodes.

4.1. Stage One: Inference up to Permutation

We start by considering the following combinatorial problem

maximize y X, y m

subject to y { 1, +1}n . (3)

The issue with this optimization formulation is that the problem is not convex, which makes the analysis hard and intractable. Instead, we consider the following relaxed version

maximize Y X, Y

subject to Y S ,n,m + 1 Yodd 1 , odd σn,m 2 Yeven = 1 , even σn,m 2 . (4)

Alternatively, we represent the last two constraints as 1 Y σn,m 2 1, and Yσn,m 2 = 1. Recall that S ,n,m + is the convex cone of rank-one tensors. The motivation is that we are using a rank-one tensor Y instead of the outer product y m, so that the problem becomes convex in the objective function and the constraints. We have 1 Y σn,m 2 1 because the product of y s is either 1 or +1, and we have Yσn,m 2 = 1 because if every yi repeats an even number of times, we know the product must be +1.

It remains to prove correctness of program (4). We are interested in identifying the regime, in which (4) returns the

Exact Inference in High-order Structured Prediction

exact rank-one tensor solution Y := y m = ( y ) m. We now present our main theorem on inference. The proof can be found in Appendix B.

Theorem 4.1 (Inference from Hyperedge Observation). For an m-order structured prediction model with underlying hypergraph G = (V, E) and hyperedge observation X, the rank-one tensor solution Y := y m = ( y ) m can be recovered from the convex optimization program (4) with probability at least 1 ϵ1(φG, n, p), where

ϵ1(φG, n, p) =2nm exp (1 2p)2φ2m G 8nm max(|E| , nm 1)

+ 16(1 p) |E|

(1 2p)2φ2m G . (5)

Remark 4.2. A natural question to ask is, under what topological and statistical conditions can we obtain a high probability guarantee from Theorem 4.1. An observation is that if the Cheeger constant of the underlying hypergraph is large, or the noise level is small, recovery is more likely to succeed. In Section 5, we analyze at some interesting classes of hypergraphs with good expansion property (large Cheeger constant), so that high probability recovery can be guaranteed.

4.2. Stage Two: Exact Inference

In the previous section, we established the high probability inference guarantee for the rank-one tensor solution Y := y m = ( y ) m. By taking a factorization step, we know either y or y is the correct label vector. In this section, our goal is to decide the correct labeling using the node observation z.

Theorem 4.3. Let y {y , y }. The correct label vector y can be recovered from program

z y = max y {y , y } z y . (6)

with probability at least 1 ϵ2(n, q), where

ϵ2(n, q) = e (1 2q)2n/2 . (7)

Proof of Theorem 4.3. Applying Hoeffding s inequality, we obtain that

P z y z y = P z y 0

e (1 2q)2n/2 .

Corollary 4.4. Combining the results in Theorem 4.1 and 4.3, we obtain that exact inference of the correct label vector y = y can be achieved with probability at least

1 ϵ1(φG, n, p) ϵ2(n, q) . (8)

Proof of Corollary 4.4. Apply a union bound to Theorem 4.1 and 4.3.

Remark 4.5. Given p, q < 0.5, we observe that as long as n , we know ϵ2(n, q) 0 (an exponential decay). As a result, whether one can obtain a high probability guarantee in the shape of 1 O(n 1) depends only on the order of φG, that is, the topological structure of the underlying hypergraph. In Section 5, we investigate hypergraphs with good expansion properties, that allow us to achieve a high probability guarantee of exact inference.

5. Examples of Hypergraphs with Good Expansion Property

In this section, we consider some example classes of hypergraphs with good expansion property, and demonstrate that they lead to high probability guarantees in our exact inference algorithm.

We ﬁrst analyze complete hypergraphs.

Deﬁnition 5.1 (Complete Hypergraphs). An m-uniform hypergraph G = (V, E) is complete, if for every m-vertex tuple e = (i1, . . . , im), we have e E.

Proposition 5.2 (Expansion Property of Complete Hypergraphs). For any complete hypergraph G = (V, E), we have φG = N

Proof. Recall that H is the set of induced hypervertices. By deﬁnition 3.3, with any S H with 0 < |S| N/2, we have

φS = |S| |H \ S|

|S| = |H \ S| .

Taking a minimum as in deﬁnition 3.4, we obtain φG = N

Corollary 5.3 (Exact Inference in Complete Hypergraphs). Let G = (V, E) be a complete hypergraph. Assume p, q < 0.5. Then exact inference can be achieved using the proposed two-stage approach with probability tending to 1 with a sufﬁciently large n.

Proof. Note that for complete hypergraphs, we have

m/2 = 2m/2 1

mm/2 nm/2 .

Exact Inference in High-order Structured Prediction

Substituting φG into (8), as long as n is greater than some constant c0, we have

1 ϵ1(φG, n, p) ϵ2(n, q) 1 O(n 1) .

Next, we focus on the case of regular expanders.

Deﬁnition 5.4 (d-regular Expander). An m-uniform hypergraph G = (V, E) is a d-regular expander hypergraph with constant c > 0, if for any induced hypervertex set S H with 0 < |S| N/2, we have

| S| c d |S| .

Proposition 5.5 (Expander Hypergraphs). For any dregular expander hypergraph G = (V, E) with constant c > 0, we have φG = cd .

Proof. By deﬁnition 3.3, with any S H with 0 < |S| N/2, we have

Taking a minimum as in deﬁnition 3.4, we obtain φG = cd.

Corollary 5.6 (Exact Inference in Expander Hypergraphs). Let G = (V, E) be a d-regular expander hypergraph with constant c > 0. Assume p, q < 0.5. Then exact inference can be achieved using the proposed two-stage approach with probability tending to 1 with a sufﬁciently large n, if

d Ω(n (log n)1/2m) .

Proof. First note that the exponential term in (5) is the dominating factor. Substituting φG = cd into the ﬁrst term of (5), we want to ensure

2nm exp (1 2p)2(cd)2m

8nm max(|E| , nm 1)

for some constant c0 > 0. Note that |E| O(nm). As a result, a sufﬁcient condition for d is d Ω(n (log n)1/2m). Substituting d into the other term also fulﬁlls the O(n 1) bound.

5.1. Simulation Results

We test the proposed method on a hypergraph with order m = 6; see Figure 2. We ﬁx the number of nodes n = 10. We focus on the Stage One, and check how many labels can be recovered up to permutation of the signs. We implement a tensor projected gradient descent solver motivated by Ke

& Honorio (2022a); Han (2013). For each setting we run 20 iterations. Our results suggest that if the noise level p is small and the hypergraph Cheeger constant φG is large, the proposed algorithm performs well and recovers the underlying group structure. This matches our theoretic ﬁndings in Theorem 4.1.

Figure 2. Simulations with different noise levels p (top) and Cheeger constant φG (down). Our results suggest that if the noise level p is small and the hypergraph Cheeger constant φG is large, the proposed algorithm performs well and recovers the underlying group structure.

Additionally, we implement a local-search based, hypergraph label propagation algorithm as a relevant baseline approximation method (Raghavan et al., 2007). We use the same setting with n = 10 and m = 6. To initialize the propagation, we make the label of ﬁve nodes to be visible to the algorithm. The algorithm proceeds as follows: for every unvisited node i, the algorithm will check its interaction with all known nodes repetitively to obtain enough samples (we set the number of repetitions to be 1001 in our script). For example, suppose i1 through ik are visited nodes, then the algorithm will repetitively check X(i, i1, i2, i3, i4, i5), X(i, i1, i2, i3, i4, i6), . . . in a stochastic fashion to collect evidence. After that, if the sum of these samples is positive, the label of i is assigned to be +1, and otherwise 1. The algorithm then marks the node as visited and continues, until all nodes are visited. We ran the experiment with p = 0.1 for 20 trials. The average accuracy of the label propagation algorithm above is 0.75 in total, and 0.5

Exact Inference in High-order Structured Prediction

on the initially unknown parts. In particular, only 1 out of 20 trials perfect recovery is achieved. This performs much worse than our results (perfect recovery in 80% trials).

5.2. Simulation on Real-world Datasets

We run the ﬁrst stage of the proposed algorithm on a fourthorder hypergraph generated from the real-world dataset email-Eu-core (Yin et al., 2017; Leskovec et al., 2007). We pick the largest two communities in the network. The generated hypergraph has 200 nodes in total (91 and 109 in each group). We run our algorithm with a tensor projected gradient descent solver on the hypergraph, and exact inference of the 200 labels was achieved perfectly up to permutation of the two groups.

6. Concluding Remarks

We analyzed the problem of high-order label recovery in this paper. To achieve this, we provided a novel class of hypergraph structural properties, and derived a high-order version Cheeger-type inequality. We also proposed a twostage algorithm to solve the high-order structured prediction problem, and provided KKT analysis for its optimality.

We gave the ﬁrst results with theoretical guarantees of inference in hypergraphs. For some context, (1) is the structured prediction inference problem in graph cases. This inference setting has been studied by Globerson et al. (2015); Foster et al. (2018); Bello & Honorio (2019), among others. These works have focused on binary label cases. And these works studied various classes of graphs, including grid lattices, graphs allowing tree decompositions, and expander graphs. (2) is the structured prediction inference problem generalized to hypergraph cases. If we set m = 2, (2) reduces to (1). We focused on hypergraph and binary label cases. Our work studied a class of hypergraphs: those with good expansion property, or hyper-expanders.

A limitation in our setting is that the observation of each hyperedge is binary: either +1 or 1. As a future direction, it can be interesting to think about more general cases, in which the observations are weighted (that is, real values instead of 1 s).

Bello, K. and Honorio, J. Exact inference in structured prediction. Advances in Neural Information Processing Systems, 32, 2019.

Boykov, Y. and Veksler, O. Graph cuts in vision and graphics: Theories and applications. In Handbook of mathematical models in computer vision, pp. 79 96. Springer, 2006.

Chan, T.-H. H., Louis, A., Tang, Z. G., and Zhang, C. Spectral properties of hypergraph laplacian and approximation algorithms. Journal of the ACM (JACM), 65(3):1 48, 2018.

Chandrasekaran, V., Srebro, N., and Harsha, P. Complexity of inference in graphical models. ar Xiv preprint ar Xiv:1206.3240, 2012.

Chang, J., Chen, Y., Qi, L., and Yan, H. Hypergraph clustering using a new laplacian tensor with applications in image processing. SIAM Journal on Imaging Sciences, 13(3):1157 1178, 2020.

Chen, Y., Qi, L., and Zhang, X. The ﬁedler vector of a laplacian tensor for hypergraph partitioning. SIAM Journal on Scientiﬁc Computing, 39(6):A2508 A2537, 2017.

Fix, A., Gruber, A., Boros, E., and Zabih, R. A hypergraphbased reduction for higher-order binary markov random ﬁelds. IEEE transactions on pattern analysis and machine intelligence, 37(7):1387 1395, 2014.

Foster, D., Sridharan, K., and Reichman, D. Inference in sparse graphs with pairwise measurements and side information. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1810 1818. PMLR, 2018.

Gallagher, A. C., Batra, D., and Parikh, D. Inference for order reduction in markov random ﬁelds. In CVPR 2011, pp. 1857 1864. IEEE, 2011.

Globerson, A., Roughgarden, T., Sontag, D., and Yildirim, C. How hard is inference for structured prediction? In International Conference on Machine Learning, pp. 2181 2190. PMLR, 2015.

Han, L. An unconstrained optimization approach for ﬁnding real eigenvalues of even order symmetric tensors. Numerical Algebra, Control and Optimization, 3(3):583 599, 2013.

Ke, C. and Honorio, J. Information-theoretic limits for community detection in network models. Advances in Neural Information Processing Systems, 31, 2018.

Ke, C. and Honorio, J. Exact inference with latent variables in an arbitrary domain. ar Xiv preprint ar Xiv:1902.03099, 2019.

Ke, C. and Honorio, J. Exact partitioning of high-order models with a novel convex tensor cone relaxation. Journal of Machine Learning Research, 23(284):1 28, 2022a.

Ke, C. and Honorio, J. Federated myopic community detection with one-shot communication. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 4937 4954. PMLR, 2022b.

Exact Inference in High-order Structured Prediction

Kelley, S., Goldberg, M., Magdon-Ismail, M., Mertsalov, K., and Wallace, A. Deﬁning and discovering communities in social networks. In Handbook of Optimization in Complex Networks, pp. 139 168. Springer, 2012.

Komodakis, N. and Paragios, N. Beyond pairwise energies: Efﬁcient optimization for higher-order mrfs. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2985 2992. IEEE, 2009.

Leskovec, J., Kleinberg, J., and Faloutsos, C. Graph evolution: Densiﬁcation and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1 (1):2 es, 2007.

Li, P. and Milenkovic, O. Submodular hypergraphs: plaplacians, cheeger inequalities and spectral clustering. In International Conference on Machine Learning, pp. 3014 3023. PMLR, 2018.

Liu, X., Bollen, J., Nelson, M. L., and Van de Sompel, H. Co-authorship networks in the digital library research community. Information processing & management, 41 (6):1462 1480, 2005.

Mulas, R. A cheeger cut for uniform hypergraphs. Graphs and Combinatorics, 37(6):2265 2286, 2021.

Nowozin, S., Lampert, C. H., et al. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3 4):185 365, 2011.

Nugent, R. and Meila, M. An overview of clustering applied to molecular biology. Statistical methods in molecular biology, pp. 369 404, 2010.

Raghavan, U. N., Albert, R., and Kumara, S. Near linear time algorithm to detect community structures in largescale networks. Physical review E, 76(3):036106, 2007.

Schraudolph, N. and Kamenetsky, D. Efﬁcient exact inference in planar ising models. Advances in Neural Information Processing Systems, 21, 2008.

Weiss, D. and Taskar, B. Structured prediction cascades. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 916 923. JMLR Workshop and Conference Proceedings, 2010.

Yin, H., Benson, A. R., Leskovec, J., and Gleich, D. F. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 555 564, 2017.

Yoshida, Y. Cheeger inequalities for submodular transformations. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, pp. 2582 2601. SIAM, 2019.

Exact Inference in High-order Structured Prediction

A. Proofs of Hypergraph Structural Properties and Cheeger-type Inequality

For clarity of presentation, in the following proofs, we use

RNL(v) := X

(i1,...,im) E ζ(vi1, . . . , vim)

to denote the numerator part of the Rayleigh quotient of Laplacian L, and

RDL(v) := v m

to denote the denominator part.

Proof of Lemma 3.14. Regarding the numerator, we have

RNL(v y + δ1) = X

(i1,...,im) E

(vi1yi1 + δ + + vim/2yim/2 + δ vim/2+1yim/2+1 δ vimyim δ)m + . . .

(i1,...,im) E

(vi1yi1 + + vim/2yim/2 vim/2+1yim/2+1 vimyim)m + . . .

= RNLy(v) .

Regarding the denominator, we have

RDL(v y + δ1) = v y + δ1 m

i (viyi + δ)2 !m/2

i v2 i y2 i + nδ2 + 2δ X

i v2 i + nδ2 !m/2

= RDLy(v) .

Proof of Lemma 3.15. Our goal is to construct a tensor D fulﬁlling D A, v m = L, v m for any vector v Rn. In other words, we require

i1,...,im Di1,...,imvi1 . . . vim X

i1,...,im Ai1,...,imvi1 . . . vim = X

i1,...,im Li1,...,imvi1 . . . vim .

A key observation is that on the left-hand side, D and A control different entries: Di1,...,im is equal to 0 in those entries without repeating indices, while Ai1,...,im is equal to 0 in those entries with repeating indices. Recall that the union σn,m 2 σn,m 2 contains all entry indices without repeating indices. We can rewrite the equation above as

(i1,...,im) σn,m 2 σn,m 2 Di1,...,imvi1 . . . vim X

(i1,...,im)/ σn,m 2 σn,m 2 Ai1,...,imvi1 . . . vim = X

i1,...,im Li1,...,imvi1 . . . vim .

Exact Inference in High-order Structured Prediction

Recall that

L, v m = 1 m! m m/2 X

(i1,...,im) E ζ(vi1, . . . , vim)

= 1 m! m m/2 X

(i1,...,im) E

(vi1 + vi2 + + vim/2 vim/2+1 vim)m + . . . . (9)

The next observation is that, by expanding the m-power terms in the bracket, for each summand (i1, . . . , im) we will obtain a sequence of monomials consisting of vi1 through vim. For example, this includes C0 vi1vi2 . . . vim 1, vim, C1 v2 i1vi2 . . . vim 1, etc., where C s are coefﬁcients. Note that all these monomials have an order of m.

We now analyze the monomial terms above by grouping the monomials with the same power pattern. Formally, we group all terms in the shape of C vd1 i1 vd2 i2 . . . vdq iq together, where d1 d2 dq 1, d1 + + dq = m. All permutation of subscripts are enumerated in the parenthesis. We use Q to denote the number of terms inside this group. Here we provide some examples of monomial groups:

C (vm i1 + vm i2 + + vm im), with pattern d1 = m. In this case Q = m.

m = 6, C (v5 i1v1 i2 + v5 i1v1 i3 + + v5 i6v1 i5), with pattern d1 = 5, d2 = 1. In this case Q = 6 5 = 30.

m = 6, C (v3 i1v2 i2v1 i3 +v3 i1v2 i2v1 i4 + +v3 i6v2 i5v1 i4), with pattern d1 = 3, d2 = 2, d3 = 1. In this case Q = 6 5 4 = 120.

Next we discuss the power pattern in each group.

If the power pattern is d1 = = dm = 1 (i.e., all 1-power components), we know the coefﬁcient C is equal to 1 m!( m m/2) by counting. Thus, these terms with all 1-power components are already balanced by

(i1,...,im)/ σn,m 2 σn,m 2 Ai1,...,imvi1 . . . vim on the left-hand side. This also shows the necessity of introducing the factor 1 m!( m m/2) in the deﬁnition of hypergraph Laplacians: it ensures that the term vi1vi2 . . . vim in the expanded form of

L, v m has a coefﬁcient of 1.

For groups with other power patterns (d1 . . . dq 1, q < m), we balance them by setting D entries with indices in the permutation of (i1, . . . , i1, i2, . . . , i2, , iq, . . . , iq), in which i1 repeats d1 times, i2 repeats d2 times, etc. We set the value of these D entries to: CQ P

iq+1,...,im Ai1,...,im = CQ P

iq+1,...,im 1[(i1, . . . , im) E]. In other words, for D entries containing index i1 through iq, we set them to be equal to those A entries containing index {i1, . . . , im}\{i1, . . . , iq}. This can be illustrated using the same examples above:

Set Di1,...,i1 = CQ P

i2,...,im Ai1,...,im.

Set Di1,i1,i1,i1,i1,i2 = CQ P i3,i4,i5,i6 Ai1,...,im, and the same for all symmetric entries in σ(i1, i1, i1, i1, i1, i2).

Set Di1,i1,i1,i2,i2,i3 = CQ P

i4,i5,i6 Ai1,...,im, and the same for all symmetric entries in σ(i1, i1, i1, i1, i1, i2).

After the procedure is done, we have L, v m = D A, v m , and Di1,i2,...,im = 0 if i1, i2, . . . , im are all different.

Proof of Theorem 3.16. Suppose v is the eigenvector associated with the second smallest eigenvalue λ2(L). By Lemma 3.12, we assume v = 1 without loss of generality. We also sort the entries of v, i.e., v1 vn.

Our ﬁrst step is to set up a helper vector u based on the second smallest eigenvector v. Recall that H is the set of induced hypervertices (see Deﬁnition 3.1). For any hypervertex hj = {i1, . . . , im/2} H, based on the second eigenvector v, we deﬁne the v-value of hj by

v(hj) := 1 m/2 vi1 + + vim/2 .

This allows us to sort all the hypervertices by their v-value, such that the hypervertex indices fulﬁll

v(h1) v(h2) v(h N) ,

Exact Inference in High-order Structured Prediction

and recall that N := |H|. We use M to denote the smallest integer fulﬁlling M 1

2 |H|, and we introduce a shifting operation by deﬁning u := v v(h M) 1. We further ﬁnd a constant cu > 0, and we scale u by multiplying u with cu so that it fulﬁlls (u1 + + um/2)2 + (un m/2+1 + + un)2 = 1. Note that u fulﬁlls the following properties:

(u1 + + um/2)2 + (un m/2+1 + + un)2 = 1.

RL(v) RL(u), by Lemma 3.12 and 3.13.

The second step of our proof is to construct a random set of hypervertices St. To do so, we deﬁne t to be a random variable on the support [u1 + + um/2, un m/2+1 + + un], with probability density function f(t) = 2 |t|. It can be veriﬁed that t is a valid random variable, because

Z un m/2+1+ +un

u1+ +um/2 2 |t| = (u1 + + um/2)2 + (un m/2+1 + + un)2 = 1 .

Based on t, we can construct a random set of hypervertices St as follows

St := {hj | hj = {i1, . . . , im/2}, ui1 + + uim/2 t} .

Here we consider the size of St in the average case. We have

Et [|St|] = X

j Pt uhj t ,

Et [|H \ St|] = X

j Pt uhj > t .

Combining the two leads to

Et [min(|St| , |H \ St|)] = X

j Pt uhj t < 0 + X

j Pt uhj > t 0

We also consider the boundary set St. By Deﬁnition 3.2, a hyperedge e = h1 h2 belongs to St, if h1 St, h2 / St, and h1 h2 = . Deﬁne the shorthand notation uh := P

i h ui, and assume uh1 uh2. Then

P {e St} = P {uh1 t uh2} |uh1 uh2| (|uh1| + |uh2|) .

Exact Inference in High-order Structured Prediction

Finally, we analyze the expectation of | St|. It follows that

Et [| St|] = X

e E,e=h1 h2 Et [1[e St]]

e E,e=h1 h2 Pt {e St}

e E,e=h1 h2 |uh1 uh2| (|uh1| + |uh2|)

e E,e=h1 h2 (uh1 uh2)m

e E,e=h1 h2 (|uh1| + |uh2|) m m 1

e E,e=h1 h2 (|uh1| + |uh2|) m m 1

= RL(u) 1 m u

e E,e=h1 h2 (|uh1| + |uh2|) m m 1

RL(u) 1 m u

e E,e=h1 h2 (|uh1| + |uh2|)2

RL(u) 1 m Et [min(|St| , |H \ St|)]

RL(u) 1 m Et [min(|St| , |H \ St|)] ,

where (a) follows from Holder s inequality. Rearranging the terms above leads to

Et h RL(u) 1 m min(|St| , |H \ St|) | St| i 0 .

Thus there exists some t fulﬁlling RL(u) 1 m min(|St| , |H \ St|) | St| 0 ,

or equivalently,

RL(u) | St| min(|St| , |V \ St|)

Plugging in RL(v) RL(u) on the left-hand side, and Deﬁnition 3.4 on the right-hand side, we obtain

λ2(L) = RL(v) RL(u) | St| min(|St| , |V \ St|)

B. Proof of Exact Recovery of True Labels

Proof of Theorem 4.1. Our goal is to recover the true labels Y := ( y ) m (up to ﬂipping signs) using (4). For readers convenience, here we restate the formulation:

maximize Y X, Y

subject to Y S ,n,m + 1 Y σn,m 2 1

Yσn,m 2 = 1 .

Exact Inference in High-order Structured Prediction

The Lagrangian dual problem of (4) is

minimize V,V +,V ,A

(i1,...,im) σn,m 2 Vi1,...,im + X

(i1,...,im) σn,m 2 (V + i1,...,im + V i1,...,im)

subject to Vi1,...,im = 0 , if i1, . . . , im are all different

V σn,m 2 = V + σn,m 2 V σn,m 2 V + σn,m 2 0

(i1,...,im) σn,m 2 Vi1,...,im Ei1,...,im + X

(i1,...,im) σn,m 2 (V + i1,...,im V i1,...,im) Ei1,...,im = 0

where Ei1,...,im is a tensor with 1 in entry (i1, . . . , im), and 0 everywhere else. From the primal and dual problems, we obtain the following KKT conditions:

V X A = 0 (Stationarity)

Yσn,m 2 = 1

Y S ,n,m + (Primal Feasibility)

Vi1,...,im = 0

V σn,m 2 = V + σn,m 2 V σn,m 2 V + σn,m 2 0

A Sn,m + (Dual Feasibility)

V + σn,m 2 (Y σn,m 2 1) = 0

V σn,m 2 (Y σn,m 2 + 1) = 0

A, Y = 0 . (Complementary Slackness)

Here we construct primal and dual variables to fulﬁll all KKT conditions above. For the primal variable we set Y = Y . For the dual variable we deﬁne V = Dy , where Dy is the high-order degree tensor constructed from the signed Laplacian Ly , with the procedure deﬁned in Lemma 3.15. We then deﬁne Vσn,m 2 = (Dy )σn,m 2 , V + σn,m 2 = max((Dy ) σn,m 2 , 0),

V σn,m 2 = min((Dy ) σn,m 2 , 0). From the stationarity condition, we set A = Ly = Dy X.

At this point, our construction of primal and dual variables have fulﬁlled every KKT conditions above except the positive semideﬁnite condition A = Ly Sn,m + . From Lemma 3.11, we know y is an eigenvector of Ly with an eigenvalue of 0, or equivalently, Ly , Y = 0. It remains to ensure for all orthogonal vectors, we have minu y RLy (u) 0. On top of that, we want to ensure the solution Y = Y is unique. This further requires that

min u y RLy (u) = min u y Dy X, u m

Without loss of generality, we ﬁx u = 1 in the following discussion. We split the terms above into

min u y , u =1

Dy X, u m min u y , u =1

Dy E [Dy ], u m (10)

+ min u y , u =1

E [X] X, u m (11)

+ min u y , u =1

E [Dy X], u m . (12)

Exact Inference in High-order Structured Prediction

First we bound the expectation term (12). Note that

min u y , u =1

E [Dy X], u m = min u y , u =1

E [Ly ], u m

= (1 2p) min u y , u =1

(i1,...,im) E ζ(ui1y i1, . . . , uimy im)

= (1 2p) RLy (u)

(1 2p) RL(u y + δ1)

(1 2p) φm G ,

where the ﬁrst inequality follows from Lemma 3.14, and the second inequality follows from Theorem 3.16.

Next we bound (11) using concentration inequalities. We have

P { λmax(E [X] X) t} P n E [X] X 2 F t2o

t2 E h E [X] X 2 F i

t2 4p(1 p) |E| .

Setting t = 1 2p

2 φm G leads to

P λmax(E [X] X) 1 2p

16p(1 p) |E|

(1 2p)2φ2m G . (13)

Finally we bound (10). Using Cauchy-Schwarz inequality, we obtain

E [Dy ] Dy , u m vec (E [Dy ] Dy ) vec u m

= vec (E [Dy ] Dy ) u m

= vec (E [Dy ] Dy )

nm/2 vec (E [Dy ] Dy ) ,

where vec ( ) is the vectorization operator. We then use Hoeffding s inequality for every entry. By the construction procedure deﬁned in the proof of Lemma 3.15, every entry of Dy is the summation of at most max(|E| , nm 1) Rademacher random variables. We obtain

P {vec (E [Dy ,i1,...,im] Dy ,i1,...,im) t} 2 exp 2t2

22 max(|E| , nm 1)

2 max(|E| , nm 1)

By a union bound, we obtain

P vec (E [Dy ] Dy ) t 2 exp 2t2

22 max(|E| , nm 1)

2 max(|E| , nm 1)

Setting t = (1 2p)φm G 2nm/2 leads to

P E [Dy ] Dy , u m 1 2p

2nm exp (1 2p)2φ2m G 8nm max(|E| , nm 1)

Combining the results of (13) and (14) with a union bound completes the proof.