# exact_inference_in_structured_prediction__011d9911.pdf

Exact inference in structured prediction

Kevin Bello Department of Computer Science Purdue Univeristy West Lafayette, IN 47906, USA kbellome@purdue.edu

Jean Honorio Department of Computer Science Purdue Univeristy West Lafayette, IN 47906, USA jhonorio@purdue.edu

Structured prediction can be thought of as a simultaneous prediction of multiple labels. This is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise and unary potentials. The above is naturally modeled with a graph, where edges and vertices are related to pairwise and unary potentials, respectively. We consider the generative process proposed by Globerson et al. (2015) and apply it to general connected graphs. We analyze the structural conditions of the graph that allow for the exact recovery of the labels. Our results show that exact recovery is possible and achievable in polynomial time for a large class of graphs. In particular, we show that graphs that are bad expanders can be exactly recovered by adding small edge perturbations coming from the Erd os Rényi model. Finally, as a byproduct of our analysis, we provide an extension of Cheeger s inequality.

1 Introduction

Throughout the years, structured prediction has been continuously used in multiple domains such as computer vision, natural language processing, and computational biology. Examples of structured prediction problems include dependency parsing, image segmentation, part-of-speech tagging, named entity recognition, and protein folding. In this setting, the input X is some observation, e.g., social network, an image, a sentence. The output is a labeling y, e.g., an assignment of each individual of a social network to a cluster, or an assignment of each pixel in the image to foreground or background, or the parse tree for the sentence. A common approach to structured prediction is to exploit local features to infer the global structure. For instance, one could include a feature that encourages two individuals of a social network to be assigned to different clusters whenever there is a strong disagreement in opinions about a particular subject. Then, one can deﬁne a posterior distribution over the set of possible labelings conditioned on the input. Some classical methods for learning the parameters of the model are conditional random ﬁelds (Lafferty et al. 2001) and structured support vector machines (Taskar et al. 2003, Tsochantaridis et al. 2005, Altun & Hofmann 2003). In this work we will focus in the inference problem and assume that the model parameters have been already learned.

In the context of Markov random ﬁelds (MRFs), for an undirected graph G = (V, E), one is interested in ﬁnding a solution to the following inference problem:

max y M|V| X

v V,m M cv(m)1[yv = m] + X

(u,v) E,m1,m2 M cu,v(m, n)1[yu = m, yv = n] , (1)

where M is the set of possible labels, cu(m) is the cost of assigning label m to node v, and cu,v(m, n) is the cost of assigning m and n to the neighbors u, v respectively.1 Similar inference problems arise

1In the literature, the cost functions cv and cu,v are also known as unary and pairwise potentials respectively.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

in the context of statistical physics, sociology, community detection, average case analysis, and graph partitioning. Very few cases of the general MRF inference problem are known to be exactly solvable in polynomial time. For example, Chandrasekaran et al. (2008) showed that (1) can be solved exactly in polynomial time for a graph G with low treewidth via the junction tree algorithm. While in the case of Ising models, Schraudolph & Kamenetsky (2009) showed that the inference problem can also be solved exactly in polynomial time for planar graphs via perfect matchings. Finally, polynomial-time solvability can also stem from properties of the pairwise potential, under this view, the inference problem can be solved exactly in polynomial time via graph cuts for binary labels and sub-modular pairwise potentials (Boykov & Veksler 2006).

Despite the intractability of maximum likelihood estimation, maximum a-posteriori estimation, and marginal inference for most models in the worst case, the inference task seems to be easier in practice than the theoretical worst case. Approximate inference algorithms can be extremely effective, often obtaining state-of-the-art results for these structured prediction tasks. Some important theoretical and empirical work on approximate inference include (Foster et al. 2018, Globerson et al. 2015, Kulesza & Pereira 2007, Sontag et al. 2012, Koo et al. 2010, Daumé et al. 2009).

In particular, Globerson et al. (2015) analyzes the hardness of approximate inference in the case where performance is measured through the Hamming error, and provide conditions for the minimumachievable Hamming error by studying a generative model. Similar to the objective (1), the authors in (Globerson et al. 2015) consider unary and pairwise noisy observations. As a concrete example (Foster et al. 2018), consider the problem of trying to recover opinions of individuals in social networks. Suppose that every individual in a social network can hold one of two opinions labeled by 1 or +1. One observes a measurement of whether neighbors in the network have an agreement in opinion, but the value of each measurement is ﬂipped with probability p (pairwise observations). Additionally, one receives estimates of the opinion of each individual, perhaps using a classiﬁcation model on their proﬁle, but these estimates are corrupted with probability q (unary observations). Foster et al. (2018) generalizes the work of Globerson et al. (2015), who provides results for grid lattices, by providing results for trees and general graphs that allow tree decompositions (e.g., hypergrids and ring lattices).

Note that the above problem is challenging since there is a statistical and computational trade-off, as in several machine learning problems. The statistical part focuses on giving highly accurate labels while ignoring computational constraints. In practice this is unrealistic, one cannot afford to wait long times for each prediction, which motivated several studies on this trade-off (e.g., Chandrasekaran & Jordan (2013), Bello & Honorio (2018)).

However, while the statistical and computational trade-off appears in general, an interesting question is whether there are conditions for when recovery of the true labels is achievable in polynomial time. That is, conditions for when the Hamming error of the prediction is zero and can be obtained efﬁciently. The present work addresses this question. In contrast to (Globerson et al. 2015, Foster et al. 2018), we study the sufﬁcient conditions for exact recovery in polynomial time, and provide high probability results for general families of undirected connected graphs, which we consider to be a novel result to the best of our knowledge. In particular, we show that weak-expander graphs (e.g., grids) can be exactly recovered by adding small perturbations (edges coming from the Erd os-Rényi model with small probability). Also, as a byproduct of our analysis, we provide an extension of Cheeger s inequality (Cheeger 1969). Finally, another work in this line was done by Chen et al. (2016), where the authors consider exact recovery for edges on sparse graphs such as grids and rings. However, (Chen et al. 2016) consider the case where one has multiple i.i.d. observations of edge labels. In contrast, we focus on the case where there is a single (noisy) observation of each edge and node in the graph.

2 Notation and Problem Formulation

This section introduces the notation used throughout the paper and formally deﬁnes the problem under analysis.

Vectors and matrices are denoted by lowercase and uppercase bold faced letters respectively (e.g., a, A), while scalars are in normal font weight (e.g., a). Moreover, random variables are written in upright shape (e.g., a, A). For a random vector a, and a random matrix A, their entries are denoted by ai and Ai,j respectively. Indexing starts at 1, with Ai,: and A:,i indicating the i-th row and i-th

column of A respectively. Finally, sets and tuples are both expressed in uppercase calligraphic fonts and shall be distinguished by the context. For example, R will denote the set of real numbers.

We now present the inference task. We consider a similar problem setting to the one in (Globerson et al. 2015), with the only difference that we consider general undirected graphs. That is, the goal is to predict a vector of n node labels y = (y1, . . . , yn) , where yi {+1, 1}, from a set of observations X and c, where X and c correspond to corrupted measurements of edges and nodes respectively. These observations are assumed to be generated from a ground truth labeling y by a generative process deﬁned via an undirected connected graph G = (V, E), an edge noise p (0, 0.5), and a node noise q (0, 0.5). For each edge (u, v) E, the edge observation Xu,v is independently sampled to be y uy v (good edge) with probability 1 p, and y uy v (bad edge) with probability p. While for each edge (u, v) / E, the observation Xu,v is always 0. Similarly, for each node u V, the node observation cu is independently sampled to be y u (good node) with probability 1 q, and y u (bad node) with probability q. Thus, we have a known undirected connected graph G, an unknown ground truth label vector y {+1, 1}n, and noisy observations X { 1, 0, +1}n n and c { 1, +1}n, and our goal is to predict a vector label y { 1, +1}n. Deﬁnition 1 (Biased Rademacher variable). Let zp {+1, 1} such that P(zp = +1) = 1 p, and P(zp = 1) = p. We call zp a biased Rademacher random variable with parameter p and expected value 1 2p.

From the deﬁnition above, we can write the edge observations as Xu,v = y uy v z(u,v) p 1 (u, v) E , where z(u,v) p is a biased Rademacher with parameter p. While the node observation is cu = y uz(u) q , where z(u) q is a biased Rademacher with parameter q.

Given the generative process, we aim to solve the following optimization problem, which is based on the maximum likelihood estimator that returns the label arg maxy P(X, y) (see Globerson et al. (2015)):

max y 1 2y Xy + αc y subject to yi = 1, (2)

where α = log 1 q

p . In general, the above combinatorial problem is NP-hard to compute (e.g., see for results on grids (Barahona 1982)). Our goal is to ﬁnd what structural properties of the graph G sufﬁce to achieve, with high probability, exact recovery in polynomial time.

3 On Exact Recovery of Labels

Our approach consists of two stages, similar in spirit to (Globerson et al. 2015). We ﬁrst use only the quadratic term from (2), which will give us two possible solutions, and then as a second stage, the linear term is used to decide the best between these two solutions.

3.1 First Stage

We analyze a semideﬁnite program (SDP) relaxation to the following combinatorial problem (3), motivated by the techniques in (Abbe et al. 2016).

max y 1 2y Xy subject to yi = 1, (3)

We denote the degree of node i as i, and the maximum node degree as max = maxi V i. For any subset S V, we denote its complement by SC such that S SC = V and S SC = . Furthermore, let E(S, SC) = {(i, j) E | i S, j SC or j S, i SC}, i.e., |E(S, SC)| denotes the number of edges between S and SC. Deﬁnition 2 (Edge Expansion). For a set S V with |S| n/2, its edge expansion, φS, is deﬁned as: φS = |E(S,SC)|/|S|. Then, the edge expansion of a graph G = (V, E) is deﬁned as: φG = min S V,|S| n/2 φS.

In the literature, φG is also known as the Cheeger constant, due to the geometric analogue deﬁned by Cheeger in (Cheeger 1969). Next, we deﬁne the Laplacian matrix of a graph and the Rayleigh quotient which are also used throughout this section.

Deﬁnition 3 (Laplacian matrix). For a graph G = (V, E) of n nodes. The Laplacian matrix L is deﬁned as L = D A, where D is the degree matrix and A is the adjacency matrix.

Deﬁnition 4 (Rayleigh quotient). For a given symmetric matrix M Rn n and non-zero vector a Rn, the Rayleigh quotient RM(a), is deﬁned as: RM(a) = a Ma

We now deﬁne a signed Laplacian matrix.

Deﬁnition 5 (Signed Laplacian matrix). For a graph G = (V, E) of n nodes. A signed Laplacian matrix, M, is a symmetric matrix that satisﬁes x Mx = P

(i,j) E(yixi yjxj)2, where y is an eigenvector of M with eigenvalue 0, and yi {+1, 1}.

Note that the typical Laplacian matrix, as in Deﬁnition 3, fulﬁlls the conditions of Deﬁnition 5 with yi = +1 for all i. Next, we present an intermediate result for later use.

Lemma 1. Let G = (V, E) be an undirected graph of n nodes with Laplacian L. Let M Rn n be a signed Laplacian with eigenvector y as in Deﬁnition 5, and let a Rn be a vector such that y, a = 0. Finally, let 1 Rn be a vector of ones. Then we have that, for a given δ R, RL(a y + δ1) RM(a), where the operator denotes the Hadamard product.

Proof. First, note that L has a 0 eigenvalue with corresponding eigenvector 1. Also, we have that x Lx = P

(i,j) E(xi xj)2, for any vector x. Then, (a y + δ1) L(a y + δ1) = P

(i,j) E((yiai + δ) (yjaj + δ))2 = (yiai yjaj)2 = a Ma. Therefore, we have that the numerators of RL(a y + δ1) and RM(a) are equal. For the denominators, one can observe that: (a y+δ1) (a y+δ1) = (a y) (a y)+2δ 1, a y +δ21 1 = P i aiyiaiyi+2δ a, y +δ2n = a a + δ2n a a, which implies that RL(a y + δ1) RM(a).

In what follows, we present our ﬁrst result, which has a connection to Cheeger s inequality (Cheeger 1969).

Theorem 1. Let G, M, L, y be deﬁned as in Lemma 1, and let λ1 λ2 λn be the eigenvalues of M. Then, we have that φ2 G 4 max λ2.

Proof. Since y is an eigenvector of M with eigenvalue 0, and M is a symmetric matrix, we can express λ2 using the variational characterization of eigenvalues as follows:

λ2 = min a Rn, a y=0 RM(a), (4)

where we used the fact that y is orthogonal to all the other eigenvectors, by the Spectral Theorem.

Assume that a is the eigenvector associated with λ2, i.e., we have that Ma = λ2a and a y = 0. Then, by Lemma 1, we have that:

RL(a y + δ1) RM(a) = λ2. (5)

Next, we choose δ R such that {a1y1 + δ, a2y2 + δ, . . . , anyn + δ} has median 0. The reason for the zero median is to later ensure that the subset of vertices S has less than n/2 vertices. Let w = a y + δ1. From equation (5), we have that RL(w) λ2.

Let w+ = (w+ i ) such that w+ i = wi if wi 0 and w+ i = 0 otherwise. Let w = (w i ) such that w i = wi if wi 0 and w i = 0 otherwise. Then, we have that either RL(w+) 2RL(w) or RL(w ) 2RL(w). Now suppose that w.l.o.g. RL(w+) 2RL(w), then, it follows that RL(w+) 2λ2.

Let us scale w+ by some constant β R so that: {βw1, βw2, . . . , βwm} [0, 1]. It is clear that RL(w+) = RL(βw+), therefore, we will still use w+ to denote the rescaled vector. That is, now the entries of vector w+ are in between 0 and 1.

Next, we will show that there exists a set S V with |S| n/2 such that: E[|E(S,SC)|]

2RL(w+) max. We construct the set S as follows. We choose t [0, 1] uniformly at random and let S = {i | (w+ i )2 t}. Let Bi,j = 1 if i S and j SC or if j S and i SC, and Bi,j = 0

otherwise. Then, E[|E(S, SC)|] = E[P (i,j) E Bi,j] = P (i,j) E E[Bi,j] = P (i,j) E P((w+ j )2 t (w+ i )2).

Recall that (w+ i )2 [0, 1], therefore, the probability above is |(w+ i )2 (w+ j )2|. Thus,

E[|E(S, SC)|] = X

(i,j) E |w+ i w+ j | |w+ i + w+ j | s X

(i,j) E (w+ i w+ j )2 s X

(i,j) E (w+ i + w+ j )2

(i,j) E (w+ i w+ j )2 s

(i,j) E (w+ i )2 + (w+ j )2 s X

(i,j) E (w+ i w+ j )2 s

i (w+ i )2,

where eq.(6) is due to Cauchy-Schwarz inequality and eq.(7) uses the maximum-degree of a node for an upper bound.

Now consider another random variable bi such that bi = 1 if i S, and bi = 0 otherwise. Therefore, we have that E[|S|] = E[P

i E[bi] = P

i P(t (w+ i )2) = P

Thus, E[|E(S,SC)|]

(i,j) E(w+ i w+ j )2

i(w+ i )2 P

i(w+ i )2 =

(i,j) E(w+ i w+ j )2 2 max P

i(w+ i )2 = p

2RL(w+) max 2 λ2 max. The above implies that there exists some S such that

|S| 2 λ2 max. Therefore, φG 2 λ2 max or equivalently φ2 G 4 max λ2.

Remark 1. For a given undirected graph G, its Laplacian matrix L fulﬁlls the conditions of Lemma 1 and Theorem 1. That is, if M = L in Theorem 1 then it becomes the known Cheeger s inequality. Therefore, our result in Theorem 1 apply for more general matrices and is of use for our next result.

We now provide the SDP relaxation of problem (3). Let Y = yy , we have that y Xy = Tr(XY ) = X, Y . Since our prediction is a column vector y, we have that yy is rank-1 and symmetric, which implies that Y is a positive semideﬁnite matrix. Therefore, our relaxation to the combinatorial problem (3) results in the following primal formulation2:

max Y X, Y subject to Yii = 1, Y 0. (8)

We will make use of the following matrix concentration inequality for our main proof. Lemma 2 (Matrix Bernstein inequality, Theorem 1.4 in (Tropp 2012)). Consider a ﬁnite sequence {Nk} of independent, random, self-adjoint matrices with dimension n. Assume that each random matrix satisﬁes E[Nk] = 0 and λmax(Nk) R almost surely. Then, for all t 0,

P λmax P k Nk t n exp t2/2 σ2+Rt/3 , where σ2 = P

k E[N2 k] .

The next theorem includes our main result and provides the conditions for exact recovery of labels with high probability. Theorem 2. Let G = (V, E) be an undirected connected graph with n nodes, Cheeger constant φG, and maximum node degree max. Then, for the combinatorial problem (3), a solution y {y , y } is achievable in polynomial time by solving the SDP based relaxation (8), with probability at least 1 ϵ1(φG, max, p), where p is the edge noise from our model, and

ϵ1(φG, max, p) = 2n e

3(1 2p)2φ4 G 1536 3maxp(1 p)+32(1 2p)(1 p)φ2 G max .

Proof. Without loss of generality assume that y = y . The ﬁrst step of our proof corresponds to ﬁnding sufﬁcient conditions for when Y = yy is the unique optimal solution to SDP (8), for which we make use of the Karush-Kuhn-Tucker (KKT) optimality conditions (Boyd & Vandenberghe 2004). In the following we write the dual formulation of SDP (8):

min V Tr(V ) subject to V X, V is diagonal. (9)

Thus, we have that Y = yy is guaranteed to be an optimal solution under the following conditions:

2Here we dropped the constant 1/2 since it does not change the decision problem.

1. yy is a feasible solution to the primal problem (8).

2. There exists a matrix V feasible for the dual formulation (9) such that Tr(Xyy ) = Tr(V ).

The ﬁrst point is trivially veriﬁed. For the second point, we assume strong duality in order to ﬁnd a dual certiﬁcate. To achieve that, we make Vi,i = (XY )i,i.3 If V X 0 then the matrix V is a feasible solution to the dual formulation. Thus, our ﬁrst condition is to have V X 0, and we conclude that yy is an optimal solution to SDP (8).

For showing that yy is the unique optimal solution, it sufﬁces to have λ2(V X) > 0. Suppose that b Y is another optimal solution to SDP (8). Then, from complementary slackness we have that V X, b Y = 0, and from primal feasibility b Y 0. Moreover, notice that we have (V X)y = 0, i.e., y is an eigenvector of V X with eigenvalue 0. By assumption, the second smallest eigenvalue of V X is greater than 0, therefore, y spans all of its null space. This fact combined with complementary slackness, primal and dual feasibility, entail that b Y is a multiple of yy . Thus, we must have that b Y = yy because b Yi,i = 1.

From the points above we arrived to the two following sufﬁcient conditions:

V X 0 and λ2(V X) > 0. (10)

Our next step is to show when condition (10) is fulﬁlled with high probability. Since we have that y is an eigenvector of V X with eigenvalue zero, showing that λ2(V X) > 0 will imply that V X is positive semideﬁnite. Therefore, we focus on controlling its second smallest eigenvalue. Next, we have that:

λ2(V X) > 0 λ2(V X E[V X] + E[V X]) > 0 λ1(V E[V]) + λ1(E[X] X) + λ2(E[V X]) > 0. (11)

We now focus on condition (11) since it implies that λ2(V X) > 0. For the ﬁrst two summands of condition (11) we make use of Lemma 2, while for the third summand we make use of Theorem 1. From Vi,i = (XY )i,i, we have that Vi,i = yi Xi,:y, thus, Vi,i = Pn j=1 yiyj Xi,j = Pn j=1 z(i,j) p 1 (i, j) E . Then, its expected value is: E[Vi,i] = i(1 2p).

Bounding the third summand of condition (11). Our goal is to ﬁnd a non-zero lower bound for the second smallest eigenvalue of E[V X]. Notice that E[V X] 0 since it is a diagonally dominant matrix, and y is its ﬁrst eigenvector with eigenvalue 0, i.e., λ1(E[V X]) = 0.

Then, we write M = E[V X]. Now we focus on ﬁnding a lower bound for λ2(M). We use the fact that for any vector a Rn, we have that a Ma = (1 2p) P

(i,j) E(yiai yjaj)2.

We also note that M has a 0 eigenvalue with eigenvector y. Thus, the matrix M/(1 2p) satisﬁes the

conditions of Theorem 1 and we have that λ2(M/(1 2p)) φ2 G 4 max . We conclude that,

λ2(E[V X]) (1 2p) φ2 G 4 max . (12)

Bounding the ﬁrst summand of condition (11). Let N(i,j) p = z(i,j) p (eie i + eje j ), where ei is the standard basis, i.e., the vector of all zeros except the i-th entry which is 1. We can now write V = P (i,j) E N(i,j) p . Then, we have a sequence of independent random matrices

{E[N(i,j) p ] N(i,j) p }, where we obtain the following: λmax(E[N(i,j) p ] N(i,j) p ) 2(1 p), and also P

(i,j) E E[(E[N(i,j) p ] N(i,j) p )2] 4 maxp(1 p).

Next, we use the fact that λmax(A) = λ1( A) for any matrix A. Then, by applying Lemma 2, we obtain:

P λ1 V E[V] (1 2p)φ2 G 8 max

3(1 2p)2φ4 G 1536 3maxp(1 p)+32(1 2p)(1 p)φ2 G max (13)

3Note that we now write V in upright shape (i.e., V) since it contains randomness from X.

Bounding the second summand of condition (11). Using similar arguments to the concentration above, we now analyze λ1(E[X] X). Let H(i,j) = Xi,j(eie j + eje i ). Then, we have a sequence of independent random matrices {H(i,j) E[H(i,j)]} and we can write X = P

(i,j) E H(i,j). Finally, we have that λmax(H(i,j) E[H(i,j)]) 2(1 p), and E[(H(i,j) E[H(i,j)])2] = 4p(1 p)(eie i + eje j ). Thus, P

(i,j) E E[(H(i,j) E[H(i,j)])2] 4 maxp(1 p) and by applying Lemma 2 we obtain:

P λ1 E[X] X (1 2p)φ2 G 8 max

3(1 2p)2φ4 G 1536 3maxp(1 p)+32(1 2p)(1 p)φ2 G max (14)

Note that the thresholds in the concentrations above are motivated by equation (12). Finally, combining equations (12), (13), and (14), we have that:

P λ2(V X) > 0 1 2ne

3(1 2p)2φ4 G 1536 3maxp(1 p)+32(1 2p)(1 p)φ2 G max ,

which concludes our proof.

Regarding the statistical part from Theorem 2, it is natural to ask under what conditions we obtain a high probability statement. For example, one can observe that if φ2 G/ max Ω(n) then there is an exponential decay in the probability of error. Another example would be if max O( n) and φ2 G/ max Ω( n) then we also obtain high probability argument. Thus, we are interested in ﬁnding what classes of graphs fulﬁll these or other structural properties so that we obtain a high probability bound in Theorem 2. Regarding the computational complexity of exact recovery, from Theorem 2, we are solving a SDP, and any SDP can be solved in polynomial time using methods such as the interior point method.

3.2 Second Stage

After the ﬁrst stage, we obtain two feasible solutions for problem (3), that is, y {y , y }. To decide which solution is correct we will use the node observations c. Speciﬁcally, we will output the vector y that maximizes the score c y. The next theorem formally states that, with high probability, y = y maximizes the score c y for a sufﬁciently large n.

Theorem 3. Let y {y , y }. Then, with probability at least 1 ϵ2(n, q), we have that: c y = maxy {y , y } c y, where ϵ2(n, q) = e n

2 (1 2q)2 and q is the node noise.

The remaining proofs of our manuscript can be found in Appendix A.

Remark 2. From Theorems 2 and 3, we obtain that exact recovery (i.e., y = y ) is achievable with probability at least 1 ϵ1(φG, max, p) ϵ2(n, q). Finally, from Theorem 3, it is clear that since the parameter q (0, 0.5), for a sufﬁciently large n we have an exponential decay of the probability of error ϵ2. Thus, we focus on the conditions of the ﬁrst stage and provide examples in the next section.

4 Examples of Graphs for Exact Recovery

In this section, we provide examples of classes of graphs that yield high probability in Theorem 2.

Perhaps the most important example we provide in this section is related to the smoothed analysis on connected graphs (Krivelevich et al. 2015). Consider any ﬁxed graph G = (V, E) and let e E be a random set of edges over the same set of vertices V, where each edge e e E is independently drawn according to the Erd os-Rényi model with probability ε/n and where ε is a small (ﬁxed) positive constant. We denote this as e E ER(n, ε/n), then let e G = (V, E e E) denote the random graph with the edge set e E added.

The model above can be considered a generalization of the classical Erd os-Rényi random graph, where one starts from an empty graph (i.e., G = (V, )) and adds edges between all possible pairs of vertices independently with a given probability. The focus on small ε means that we are interested in the effect of a rather gentle random perturbation. In particular, it is known that graphs with bad expansion are not suitable for exact inference (see for instance, (Abbe et al. 2014)), but certain classes

such as grids or planar graphs can yield good approximation under some regimes despite being bad expanders as shown by Globerson et al. (2015). Here we consider the graph G to be a bad expander and show that with a small perturbation, exact inference is achievable.

The following result was presented by (Krivelevich et al. 2015) in an equivalent fashion.4

Lemma 3 (Theorem 2 in (Krivelevich et al. 2015)). Let G = (V, E) be a connected graph, choose e E ER(n, ε/n), and let e G = (V, E e E). Then, for every ε [1, n], we have that φ e G ε 256+256 log n,

with probability at least 1 n 2.2 log ε

The above lemma allows us to lower bound the Cheeger constant of the random graph e G with high probability, and is of use for our ﬁrst example.

Corollary 1. Let G = (V, E) be any connected graph, choose e E ER(n, log8 n/n), let e G = (V, E e E) and let e G max be the maximum node degree of e G. Then, we have that φ2 e G/ e G max Ω(log5 n) and e G max O(log9 n) with high probability. Therefore, exact recovery in polynomial time is achievable with high probability.

We emphasize the nice property of random graphs e G shown in Corollary 1, that is, by adding a small perturbation (edges from the Erd os-Rényi model with small probability) we are able to obtain exact inference despite of G having bad properties such as being a bad expander. Our next two examples include complete graphs and d-regular expanders. The following corollary shows that, with high probability, exact recovery of labels for complete graphs is possible in polynomial time.

Corollary 2 (Complete graphs). Let G = Kn, where Kn denotes a complete graph of n nodes. Then, we have that φ2 G/ max Ω(n). Therefore, exact recovery in polynomial time is achievable with high probability.

Another important class of graphs that admits exact recovery is the family of d-regular expanders (Hoory et al. 2006), which is deﬁned below.

Deﬁnition 6 (d-regular expander). A d-regular graph with n nodes is an expander with constant c > 0 if, for every set S V with |S| n/2, |E(S, SC)| c d |S|.

Corollary 3 (Expanders graphs). Let G be a d-regular expander with constant c. Then, we have that φ2 G/ max Ω(d). If d Ω(log n) then exact recovery in polynomial time is achievable with high probability.

5 Concluding Remarks

We considered a model where we receive a single noisy observation for each edge and each node of a graph. Our approach consisted of two stages, similar in spirit to (Globerson et al. 2015). The ﬁrst stage consisted of solving solely the quadratic term of the optimization problem and was based in a SDP relaxation in order to ﬁnd the structural properties of a graph that guarantee exact recovery with high probability. Given two solutions from the ﬁrst stage, the second stage consisted in using solely the node observations and simply outputting the vector with higher score. We showed that for any graph G, the term φ2 G/ max is related to achieve exact recovery in polynomial time. Examples include complete graphs and d-regular expanders, that are guaranteed to recover the correct labeling with high probability. While perhaps the most interesting example is related to smoothed analysis on connected graphs, where even for a graph with bad properties such as bad expansion can still be exactly recovered by adding small perturbations (edges coming from an Erd os-Rényi model with small probability).

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 1716609-IIS.

4 Speciﬁcally, we set α = 1/2, δ = ε/256, K = 128/ε, C = 1, s = K log n, which results with all the conditions being fulﬁlled in the proof of Theorem 2 in (Krivelevich et al. 2015).

Abbe, E., Bandeira, A. S., Bracher, A. & Singer, A. (2014), Decoding binary node labels from censored edge measurements: Phase transition and efﬁcient recovery , IEEE Transactions on Network Science and Engineering 1(1), 10 22.

Abbe, E., Bandeira, A. S. & Hall, G. (2016), Exact recovery in the stochastic block model , IEEE Transactions on Information Theory 62(1), 471 487.

Altun, Y. & Hofmann, T. (2003), Large margin methods for label sequence learning , European Conference on Speech Communication and Technology pp. 145 152.

Barahona, F. (1982), On the computational complexity of ising spin glass models , Journal of Physics A: Mathematical and General 15(10), 3241.

Bello, K. & Honorio, J. (2018), Learning latent variable structured prediction models with gaussian perturbations , Neur IPS .

Boyd, S. & Vandenberghe, L. (2004), Convex optimization, Cambridge university press.

Boykov, Y. & Veksler, O. (2006), Graph cuts in vision and graphics: Theories and applications, in Handbook of mathematical models in computer vision , Springer, pp. 79 96.

Chandrasekaran, V. & Jordan, M. I. (2013), Computational and statistical tradeoffs via convex relaxation , Proceedings of the National Academy of Sciences p. 201302293.

Chandrasekaran, V., Srebro, N. & Harsha, P. (2008), Complexity of inference in graphical models, in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artiﬁcial Intelligence , AUAI Press, pp. 70 78.

Cheeger, J. (1969), A lower bound for the smallest eigenvalue of the laplacian, in Proceedings of the Princeton conference in honor of Professor S. Bochner .

Chen, Y., Kamath, G., Suh, C. & Tse, D. (2016), Community recovery in graphs with locality, in International Conference on Machine Learning , pp. 689 698.

Daumé, H., Langford, J. & Marcu, D. (2009), Search-based structured prediction , Machine learning 75(3), 297 325.

Foster, D., Sridharan, K. & Reichman, D. (2018), Inference in sparse graphs with pairwise measurements and side information, in International Conference on Artiﬁcial Intelligence and Statistics , pp. 1810 1818.

Globerson, A., Roughgarden, T., Sontag, D. & Yildirim, C. (2015), How hard is inference for structured prediction?, in International Conference on Machine Learning , pp. 2181 2190.

Hoory, S., Linial, N. & Wigderson, A. (2006), Expander graphs and their applications , Bulletin of the American Mathematical Society 43(4), 439 561.

Koo, T., Rush, A. M., Collins, M., Jaakkola, T. & Sontag, D. (2010), Dual decomposition for parsing with non-projective head automata, in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, pp. 1288 1298.

Krivelevich, M., Reichman, D. & Samotij, W. (2015), Smoothed analysis on connected graphs , SIAM Journal on Discrete Mathematics 29(3), 1654 1669.

Kulesza, A. & Pereira, F. (2007), Structured learning with approximate inference , Neural Information Processing Systems 20, 785 792.

Lafferty, J., Mc Callum, A. & Pereira, F. C. (2001), Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data .

Schraudolph, N. N. & Kamenetsky, D. (2009), Efﬁcient exact inference in planar ising models, in Advances in Neural Information Processing Systems , pp. 1417 1424.

Sontag, D., Choe, D. K. & Li, Y. (2012), Efﬁciently searching for frustrated cycles in map inference , ar Xiv preprint ar Xiv:1210.4902 .

Taskar, B., Guestrin, C. & Koller, D. (2003), Max-margin Markov networks , Neural Information Processing Systems 16, 25 32.

Tropp, J. A. (2012), User-friendly tail bounds for sums of random matrices , Foundations of computational mathematics 12(4), 389 434.

Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. (2005), Large margin methods for structured and interdependent output variables , Journal of machine learning research 6(Sep), 1453 1484.