# adaptive_belief_propagation__27ad3229.pdf Adaptive Belief Propagation Georgios Papachristoudis GEOPAPA@MIT.EDU John W. Fisher III FISHER@CSAIL.MIT.EDU CSAIL, MIT, Cambridge, MA 02139, USA Graphical models are widely used in inference problems. In practice, one may construct a single large-scale model to explain a phenomenon of interest, which may be utilized in a variety of settings. The latent variables of interest, which can differ in each setting, may only represent a small subset of all variables. The marginals of variables of interest may change after the addition of measurements at different time points. In such adaptive settings, na ıve algorithms, such as standard belief propagation (BP), may utilize many unnecessary computations by propagating messages over the entire graph. Here, we formulate an efficient inference procedure, termed adaptive BP (Ada BP), suitable for adaptive inference settings. We show that it gives exact results for trees in discrete and Gaussian Markov Random Fields (MRFs), and provide an extension to Gaussian loopy graphs. We also provide extensions on finding the most likely sequence of the entire latent graph. Lastly, we compare the proposed method to standard BP and to that of (S umer et al., 2011), which tackles the same problem. We show in synthetic and real experiments that it outperforms standard BP by orders of magnitude and explore the settings that it is advantageous over (S umer et al., 2011). 1. Introduction Many estimation problems can be cast as inference in graphical models, where nodes represent variables of interest and edges between them indicate dependence relations. Na ıve inference may have exponential complexity in the number of variables. Message passing algorithms, such as BP (Pearl, 1982) reduce the complexity significantly. Despite the fact that BP performs exact inference Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). only on trees, it is often applied to loopy graphs (for which it is approximate) due to its computational efficiency. We consider the problem of inference in large-scale models. Such models which arise, for example, in complex spatio-temporal phenomena, may be utilized in multiple settings. It is often the case that only a subset of latent variables is of interest for different applications which may vary from instance to instance (Rosales & Jaakkola, 2005; Chechetka & Guestrin, 2010; Wick & Mc Callum, 2011). Additionally, the set of available measurements may vary with use or become available at different points in time. The latter is common for any sequential estimation problem. In such situations, general-purpose inference algorithms, such as BP may utilize many unnecessary computations when only a small subset is desired. The complexity of such approaches becomes prohibitive as the size of graph increases, e.g., due to constant re-evaluation of messages. There exist several examples that fall into this category of problems. Patient monitoring provides one such practical example. Large-scale systems may monitor the health status of many patients; however, different physicians limit their interest to patients under their immediate care. Temperature monitoring sensors provide data over time and space, but sensitive areas (e.g., server room) may require more careful examination for the timely response in case of abnormal behavior. Lastly, in computational biology, the effects of mutations are explored (computational mutagenesis), with each putative mutation resulting in a very similar problem. This motivates methods for problems where measurements are added incrementally and the interest is in a subset of node marginals at a given time point or the MAP sequence of the full latent graph. This is the problem of adaptive inference, where the goal is to take advantage of previously computed quantities instead of performing inference from scratch. In these cases, standard BP results in many redundant computations. Consequently, we develop an adaptive inference approach which avoids redundant computations and whose average-case performance shows significantly lower complexity compared to BP. The main idea is to send only messages between the node where a measurement has Adaptive Belief Propagation been obtained from (wℓ) and the node whose marginal is of interest (vℓ).1 The correctness of this approach is guaranteed by propagating messages between consecutive measurement nodes wℓ 1, wℓat every iteration as shown in Fig. 1. As a result, we only send the necessary messages to guarantee that the incoming messages to the node of interest vℓare correct. We call this minimal messaging schedule adaptive BP. We show that it gives exact results on trees (as standard BP) and provide an extension for Gaussian loopy graphs that still guarantees exactness in the evaluation of marginals. The proposed method requires a preprocessing step of O(N log N) time, where N is the number of latent nodes. In the worst case, when relative distance between consecutive measurement nodes is approximately the tree diameter and the diameter is on the order of N (highly unbalanced tree), the performance is comparable yet still faster to standard BP. However, for height-balanced trees worstcase performance results in O(log N) messages per update as compared to O(N) for standard BP. In the worst case, if distance of consecutive nodes is very small, the computation of the node marginal is obtained in constant time per iteration. We provide an extension of the method for MAP inference and for Gaussian loopy MRFs and show how it can be used to suggest nearly optimal measurement schedules. We compare the proposed method to (S umer et al., 2011) and examine settings under one approach may have advantages over the other. Lastly, we empirically demonstrate the performance of our method in a variety of synthetic datasets, as well for two real applications. Related Work. S umer et al. consider the same problem in the context of factor graphs (S umer et al., 2011) utilizing the factor elimination algorithm to evaluate node marginals (Darwiche & Hopkins, 2001). They construct a balanced representation of the elimination tree in O(|X|3tw N) time, which allows for computation of a node marginal in O(|X|2tw log N), where N is the number of nodes, tw is the elimination tree width (size of the largest clique in the chordal graph minus one) and |X| the alphabet size. However, the preprocessing step becomes prohibitive as |X| and tw grow large, thus making this method inappropriate for dense loopy graphs. For trees, the width of the elimination tree is one and the complexity of updating the model reduces to O(|X|3 log N) as compared to O(|X|N) for standard BP. Note that they address the discrete case only. As we show later, the computational complexity is impacted significantly by not taking into account the relative distances between consecutive nodes of interest. (Wick & Mc Callum, 2011) consider the focused inference problem, where they propose a prioritized scheme of Metropolis Hastings algorithm that focuses on sampling variables from a set of interest. (Chechetka & Guestrin, 2010) examine the 1We will refer to wℓas measurement node for abbreviation. same problem. They create a prioritized message schedule weighted towards messages to which the set of interest is most sensitive. Their approach is limited to discrete graphs, the set of interest is fixed and the preprocessing time depends on the number of edges and neighbors of the nodes. In contrast, our proposed method extends to loopy Gaussian models, has a reduced pre-processing time, and allows for varying sets of interest. 2. Problem Statement Consider the Markov Random Field (MRF) which represents a graph G = (V, E) of N latent variables, X = {X1, . . . , XN}, whose direct dependencies are represented by edge set E. Let s denote the neighbors of latent node Xk with N(k) and assume each latent node Xk is linked to mk measurements {Yk,1:mk}. The set {Yk,1:mk} will be called observation set and be denoted by Sk. In addition, each Xk X. A feedback vertex set (FVS) F, is a set of nodes whose removal results in a cycle free graph T = V \ F (forest of trees). Obviously, F = in the case of trees. We denote |F| = K to be the size of FVS. We would also call all nodes in T that are neighbors to an FV node as anchors and denote them by A. That is, A = {i | i T , i N(p), p F}. For the purpose of our analysis, we focus on discrete MRFs, but the proposed method generalizes straightforwardly to Gaussian MRFs (Weiss & Freeman, 2001). Lastly, a common assumption is that measurements are conditionally independent on X. We focus on pairwise MRFs since it can be shown that MRFs with larger cliques can be reduced to pairwise ones (Wainwright et al., 2005). We consider three types of potentials; node potentials of latent variables, ϕ(0) k (xk), pairwise potentials between latent and observed variables, χkℓ(xk, yℓ), and pairwise potentials between latent variables, ψij(xi, xj). In the gaussian case, we assume that the variables follow a multivariate Gaussian with parameters h, J, where h is the full potential vector and J the full precision (information) matrix. Absence of an edge between two nodes i, j implies that Jij = 0 and vice versa. We are interested in problems where a measurement is added at a time and only one or a few marginals are of interest at any point. The total number of available measurements is M = PN k=1 mk. Lastly, we are given a measurement plan (order) w = {w1, . . . , w M} = {w1:M}, which provides the order of taking measurements from each set. That is, a measurement is obtained from set Sw1, then from Sw2, and so on. We call marginal order, v = {v1:M}, the sequence of the latent nodes whose marginal is of interest at each step. Belief Propagation. Belief propagation is a message passing algorithm where two messages are propagated on each edge (i, j), one on each direction. A message from node i Adaptive Belief Propagation to node j essentially contains all the information from the subtree rooted at i. In its serial version, an arbitrary node is chosen as a root, then messages are passed from leaves to the root, and then back to the leaves. In the discrete case, messages and marginal of interest are computed as mi j(xj) = X xi ϕi(xi)ψij(xi, xj) Y k N(i)\j mk i(xi) p Xi(xi) ϕi(xi) Y k N(i)\j mk i(xi), (2) while in the gaussian case as hi j = Jji J 1 i\jhi\j Ji j = Jji J 1 i\j Jij (3) ˆhi = hi + X k N(i) hk i ˆJi = Jii + X k N(i) Jk i, (4) where hi\j = hi + P k N(i)\j hk i, Ji\j = Jii + P k N(i)\j Jk i. Belief propagation is a dynamic programming algorithm that takes advantage of the tree structure to avoid recomputing recurring quantities. The complexity of sending a message in the discrete case is O(|X|2), where |X| is the alphabet size. It is O(d3) for the Gaussian case, where d is the dimension of the latent node. The overall complexity in its serial version is O(N|X|2). Updating node potentials. Observed nodes are essentially absorbed into the node potential of the latent node they link to. If a measurement Ywℓ,u = yu is added at iteration ℓ linking to node Xwℓ, it updates node Xwℓpotential as ϕ(ℓ) wℓ(xwℓ) = ϕ(ℓ 1) wℓ (xwℓ)χwℓu(xwℓ, yu), where ϕ(ℓ 1) wℓ ( ) is the node potential before the incorporation of measurement yu. Updating the node potential requires O(|X|) time. In the Gaussian case, incorporating a measurement in node wℓis as straightforward. It becomes clear that all the information about observed nodes is easily absorbed into the latent node potentials. Thus, we will henceforth only consider the graph of latent nodes. 3. Method Description For the purposes of analysis, we would delay the discussion to general Gaussian MRFs until Sec. 5. We will consider trees here and show later an extension to Gaussian loopy graphs. As a reminder, we obtain one measurement at every step and are interested in finding the marginal at a given node. We show in the supplement how this can be extended to multiple measurements or nodes of interest at a time. The measurement order w = {w1, . . . , w M} is the order that measurements are obtained, while the marginal order v = {v1, . . . , v M} determines the marginals of interest at any time. Again, the key idea is to propagate messages in the paths (wℓ 1, wℓ) and (wℓ, vℓ), ℓ. The path between any two nodes can be determined trivially if their lowest common ancestor (lca) is known. The problem of finding the lca of two nodes is well-studied and is related to the famous Range Minimum Query (RMQ) problem, which returns the index of the minimum element between two specified indices of an array A. This algorithm, as discussed in (Harel & Tarjan, 1984), requires the building of a structure M of size N L, where L = log2 N + 1, which returns the index of the minimum array element between two specified indices of A in constant time. It is extremely well-suited for problems with a large number of queries, R, where R N, since it is linear in R. It turns out that the LCA can be reduced to the RMQ problem as shown below (Czumaj et al., 2007), (Fischer & Heun, 2007). Lowest Common Ancestor. For a specified root, each node is labeled in a breadth-first manner. That is, the root is assigned label 1, and all other nodes are labeled accordingly in a top-down, left-right approach (cf. Fig. 1). As we mentioned above, the index of the minimum element in the subarray A[i . . . j] is provided in constant time by building the RMQ structure. Now, suppose we recover the Euler tour E of the tree starting from the root. As a reminder, the Euler tour of a strongly connected, directed graph G is a cycle that traverses each edge of G exactly once, although it may visit a node more than once (Cormen et al., 2009). Since we are dealing with undirected graphs here, we assume for the purposes of analysis that each undirected edge is equivalent to two directed edges of opposing direction. The number of edges entering or exiting a node is called the in-degree or out-degree, respectively. Since by construction each node has equal outand in-degree, the Euler tour is always a cycle, that is, it starts and ends on the same node (here, the root). If we denote by H the vector which stores the index of the first occurrence of each node in E, the lca of nodes wℓ 1, wℓwould be somewhere in E[Hwℓ 1, . . . , Hwℓ] due to the way the Euler tour is constructed (depth-first manner). Since the nodes are labeled in a breadth-first manner, the lca of wℓ 1, wℓ would be the one with the smallest label and hence the smallest depth in the range E[Hwℓ 1, . . . , Hwℓ]. It becomes apparent that we need to introduce a vector De which would store the depth of the corresponding nodes in the Euler tour. For example, the depth of the first node in the Euler tour is [De]1 = 0, because it is the root. Since the lca of wℓ 1, wℓis the node with the smallest depth in E[Hwℓ 1, . . . , Hwℓ], the index of the minimum element of subarray De[Hwℓ 1, . . . , Hwℓ] would give us the lca(wℓ 1, wℓ). It remains now to build a matrix M that would provide answers to queries of the type arg min De[Hwℓ 1, . . . , Hwℓ] in constant time. The size of this matrix would be N L, where L = log2 N +1, while element [M]i,j would represent the index of the minimum element of the subarray Adaptive Belief Propagation De that starts at i and has length 2j 1: ( [M]i,j 1 , [De][M]i,j 1 [De][M]r,j 1 [M]min{i+2j 2,N},j 1 , otherwise, where i = 1, . . . , N, j = 1, . . . , L, r = min{i + 2j 2, N} and [M]i,1 = i. The absolute index of the minimum value of De[i, . . . , j] is recovered in constant time as RMQDe(i, j) = ( [M]i,k+1 , [De][M]i,k+1 [De][M]s,k+1 [M]j 2k+1,k+1 , otherwise, where k = log2(j i + 1) and s = j 2k + 1. The lca of wℓ 1 and wℓis simply lca(wℓ 1, wℓ) = ( ERMQDe(Hwℓ 1,Hwℓ), Hwℓ 1 < Hwℓ ERMQDe(Hwℓ,Hwℓ 1), otherwise. (5) Adaptive BP. With a careful inspection, we observe that after the incorporation of a measurement at node Xwℓ, the evaluation of the messages along the unique path from node wℓto node vℓis sufficient for the determination of node vℓ s marginal. This is the key point of the adaptive BP algorithm. The above procedure guarantees to give the correct marginals along this path as long as all the incoming messages to node wℓare correct. This is possible, if we additionally propagate messages from wℓ 1 to wℓat every iteration. The algorithm is described as follows. During initialization, all node potentials we propagate messages along the entire graph in both directions. At this point, as a new measurement arrives from set Sw1, the messages from w1 to v1 are computed. This way, the marginals of the nodes in the path that connects w1, v1 (incl. w1, v1) are correctly updated. Then, we propagate messages from w1 to w2, update the node potential of Xw2 and send messages from w2 to v2. We continue with this procedure for each ℓ. If wℓ= wℓ 1, no messages are propagated from wℓ 1 to wℓ, while if wℓ= vℓ 1, only the node potential Xwℓis updated. Obviously, the path from node wℓ 1 to wℓ is directly related to the lca(wℓ 1, wℓ). Similarly, for the pair (wℓ, vℓ). Therefore, at every iteration, we need to determine the lcas of these two pairs, which is accomplished in constant time, with the reduction to the RMQ problem. Once we find the lca of pair (wℓ 1, wℓ), we can trivially determine the directed path from wℓ 1 to wℓby traversing from wℓ 1 up to lca(wℓ 1, wℓ) and then down to wℓ. We will denote the messages in this path by M(wℓ 1 wℓ). Similarly, we denote the messages of the directed path from wℓto vℓby M(wℓ vℓ). Note here that both of the above schedules contain only the single-direction messages from one node to another. The update is done in the same manner as in the serial version of BP, that is, we propagate messages from wℓ 1 to the lca(wℓ 1, wℓ) and then down to wℓ. The procedure is the same for the pair (wℓ, vℓ). A flow of the algorithm and a detailed description are provided in Fig. 1 and Alg. 1, respectively. (a) Messages from wℓ 1 to wℓ (b) Messages from wℓto vℓ Figure 1. The bold node in black represents the current measurement node wℓ, while the bold node in gray the previous measurement node wℓ 1. The node in question mark represents the node of interest vℓ. (a) In the first phase of an iteration, we propagate messages from wℓ 1 to wℓ(depicted in purple color). (b) In the second phase, we propagate messages from wℓto vℓ(depicted in black color) after we have updated the node potential at wℓ, which has changed due to the addition of a new measurement. Messages are updated as mi j(xj) = X xi ϕi(xi)ψij(xi, xj) Y k N(i)\j mk i(xi), (i, j) M(wℓ 1 wℓ) and M(wℓ vℓ), while the marginal of node of interest vℓ, is computed from Eq. (2). If the latent graph is a chain, there is no need to find the lca: we simply propagate from wℓ 1 to wℓ, update the node potential of wℓand propagate messages to vℓ. Proposition 1. Alg. 1 correctly updates the messages in path M(wℓ 1 wℓ), ℓ. (Proof of this and subsequent corollaries in the supplement) Corollary 1. Alg. 1 correctly updates the messages in path M(wℓ vℓ), ℓ. Corollary 2. Alg. 1 provides the exact marginals of all nodes in path M(wℓ vℓ), ℓ. Algorithm 1 ADAPTIVE BP Preprocessing Determine Euler tour E, depths of elements in the Euler tour De, vector H which stores the index of the first occurrence of node i in E, and matrix M which stores the index of the minimum value of the subarray of De starting at i and having length 2j 1. Initialization Initialize the node, pairwise potentials and messages. Iteration for ℓ= 1, . . . , M do Find lca(wℓ 1, wℓ) from Eq. (5). Determine the schedule M(wℓ 1 wℓ). Compute messages mi j(xj) in M(wℓ 1 wℓ). Update the node potential at Xwℓ. Find lca(wℓ, vℓ) from Eq. (5). Determine the schedule M(wℓ vℓ). Compute messages mi j(xj) in M(wℓ vℓ). Compute the marginal of interest p Xvℓ(xvℓ). end for Adaptive Belief Propagation Complexity. If the depth of each node (D) is not known in advance, it can be retrieved in O(N) time, in a depth-first approach. Similarly, the Euler tour is also retrievable in linear time. The same holds for vectors De and H. Lastly, the creation of matrix M, takes O(N log2 N) time and space. Therefore, the overall complexity of preprocessing is O(N log2 N). For adaptive BP, we only need to send messages along the directed paths M(wℓ 1 wℓ) and M(wℓ vℓ). The number of messages to be sent in step ℓis dist(wℓ 1, wℓ) + dist(wℓ, vℓ).2 The overall complexity is O(PM ℓ=1(dist(wℓ 1, wℓ) + dist(wℓ, vℓ))|X|2). Compare this with standard BP, where 2(N 1) messages are sent at each iteration resulting in an overall complexity of O(m N 2|X|2), assuming that the number of measurements from each set is the same, mk = m, k. As we see, the complexity of adaptive BP directly depends on the context of the measurement and marginal order, while standard BP has a fixed cost per iteration. We will analyze the worst, best and average complexity of adaptive BP for balanced and unbalanced trees. In the worst-case, when the tree is highly unbalanced (tree diameter on the order of N) and the relative distance between (wℓ 1, wℓ), (wℓ, vℓ) is comparable to the diameter for all ℓ, we need to transmit O(N) messages at every iteration. In this case, the order of the number of messages to be sent is the same with standard BP. If, instead, the latent graph is a balanced tree, with each node having approximately q children, O logq N messages are propagated at every iteration in the worst case. In the best-case scenario, if wℓ 1, wℓ, vℓare akin to each other (e.g., parent-child or siblings) for every ℓ, then only one or two messages are propagated at every iteration, which reduces the overall complexity to just O m N|X|2 . As expected, when there is small distance between pairs of nodes (wℓ 1, wℓ), (wℓ, vℓ), the complexity is substantially reduced. Complexity only depends on the relative distance between consecutive terms. Structure comes only into consideration, in the worst case, when the relative distance between (wℓ 1, wℓ) and (wℓ, vℓ) are consistently comparable to the tree diameter. 4. Extension to Max-Product In the case of max-product, we replace sum with max and introduce a new type of messages, called delta messages, that will be used for the recovery of the MAP sequence. A delta message indicates the value of the source node that corresponds to the MAP sequence of the subtree rooted at the source node (excl. the branch containing the target node) for a specific value of the target node. In order to recover the MAP sequence, we need to propagate delta messages from wℓ 1 to wℓand then backtrack from wℓdown to the leaves (considering wℓas the root). 2The distance between nodes w, v is the length of the path connecting them and equals dist(w, v) = Dv +Dw 2Dlca(w,v). In general, obtaining the MAP sequence is a linear operation in the number of nodes. However, local changes in node potentials might induce only small changes in the MAP sequence. We should note that the only delta messages pointing towards the root wℓthat change, are the ones the path M(wℓ 1 wℓ), which are correctly updated during iteration ℓ. A visualization of the algorithm is provided in the supplement. This observation can help us recover the MAP sequence in a more efficient way. We create an indicator sparse matrix, whose rows represent the source and columns the target of a delta message. We assign value 1 to any delta message that became dirty (changed) in the most recent iteration. That is, every message in M(wℓ 1 wℓ). Therefore, when we backtrack from wℓdown to the leaves, we must consider the effect that these changed messages have in the MAP sequence. Nevertheless, if a node s maximizing value remains the same (with the previous iteration), then the MAP subsequences of the subtrees rooted at the neighbors of this node will also remain the same since the delta messages from these neighbors to the node stayed intact. Because of that, there is no need to backtrack further down to the subtrees of a node s neighbors, if this node s maximizing value did not change between consecutive iterations. 5. Extension to Gaussian Loopy MRFs Adaptive BP can be extended to Gaussian loopy graphs by using the Feedback Message Passing (FMP) algorithm by (Liu et al., 2012). Their algorithm provides a way to evaluate the exact means and variances in loopy Gaussian MRFs. On the first step, an FVS F is determined from one of the existing algorithms. Then, the exact means and variances are obtained in two rounds. In the first round, BP runs on {h T , JT }, where h T , JT correspond to the part of full potential vector and block of full information matrix that contain only nodes in T . We also run BP |F| = K more times with parameters {hp, JT }p F, where hp = JT p. In other words, hp is the part of column of J that corresponds to node p and contains only the rows that correspond to nodes in T . This provides us with partial means and variances ˆµT i , ˆΣT ii, i V as well as feedback gains gp i , i V, p F that will be used in the second round of the method. In the second round, the means and variances in FVS F are evaluated by inverting ˆJF and the exact means and variances are retrieved by running BP one more time and adding correction terms to the estimated variances from the first round, ˆΣT ii (see Alg. 2 for more details). After briefly outlining FMP, we move on by describing the extension of Ada BP to Gaussian loopy graphs. As a reminder, once we determine the FVS F, the remaining graph T is a tree. Let s denote by w T ℓthe node from T Adaptive Belief Propagation Algorithm 2 FEEDBACK MESSAGE PASSING (FMP) 1. Construct K potential vectors: hp = JT p, p F. 2. Run BP K + 1 times on T with parameters {h T , JT }, {hp, JT }p F, which will produce messages h T i j, hp i j, JT i j and marginals ˆµT i , gp i , ˆΣT ii. 3. Obtain the K sized graph with updated parameters ˆh F, ˆJF as [ ˆJF]pq = Jpq X i N(p) T Jpigq i , p, q F (6) [ˆh F]p = hp X i N(p) T JpiˆµT i , p F (7) and solve for ΣF = ˆJ 1 F and µF = ΣFˆh F. 4. Revise the potential vector in T as hi = hi P j N(i) F Jij[µF]j, i T and obtain the exact means by running BP one more time on the revised potential vector (the corresponding messages will be denoted by h T i j). 5. Correct the variances with Σii = ˆΣT ii + X q F gp i [ΣF]pqgq i , i T . (8) where a measurement has been obtained most recently ( wℓ , if wℓ T w T ℓ 1 , otherwise. Let s further denote the subtrees of T rooted at w T ℓ, vℓ T with the nodes in A as their leaves by T w ℓ and T v ℓ, respectively. Depending on the size of the anchor set A, and the allocation of its nodes inside T , subtrees T w ℓ, T v ℓcan be much smaller than T , |T w ℓ|, |T v ℓ| |T |. As we see from steps 4 and 5 of Alg. 2, the evaluation of the marginal at vℓrequires the knowledge of µF, ΣF which in turn require the knowledge of partial means ˆµT i and feedback gains gp i at the anchors A, p F. The quantities ˆµT i , gp i , i A will be correct as long as the messages between wℓand A are correct, which is guaranteed by sending messages between consecutive measurement nodes as we did in the tree case. The difference in the loopy case is that since some measurement nodes might belong to the FVS, we should always propagate messages from the most recent measurement node in T (that is, w T ℓ), to the next measurement node wℓ T to ensure consistency. If wℓ/ T , propagation is not necessary. For the evaluation of vℓ s marginal, we further need to propagate from w T ℓto all nodes in A, which ensures the correctness of all incoming messages to nodes in A. After that, the potential vector ˆh F and information matrix ˆJF are updated correctly from Eqs. (6), (7) which leads to the right mean and covariance µF, ΣF of the FVS F. If vℓ F, we re- trieve the marginal from µF, ΣF as [µF]vℓ, [ΣF]vℓvℓ. Otherwise, we revise the potential vectors according to Step 4 of Alg. 2 and propagate messages hp i j, p from w T ℓto vℓ. Tables 1, 2, 3 summarize the messaging protocol for every iteration ℓ. A more detailed discussion is provided in the supplement. Table 1. Messages between wℓ 1, wℓ wℓ 1 {F, T } Send JT i j, h T i j, hp i j in M(w T ℓ 1 wℓ) Table 2. First-round messages between wℓ, vℓ wℓ {F, T } Send JT i j, h T i j, hp i j in M(w T ℓ A) Table 3. Second-round messages between wℓ, vℓ wℓ {F, T } Send h T i j in M(A vℓ) Send hp i j in M(w T ℓ vℓ) In terms of complexity, we first need to determine the FVS. Even though, finding the minimum FVS is NP-complete, there are approximate algorithms that find an FVS with size comparable to the optimal. For example, (Bafna et al., 1999) provide a 2-approximation, which runs in O(min{|E| log N, N 2}) time. At every iteration we need to send (K + 2)dist(w T ℓ 1, wℓ) messages between w T ℓ 1 and wℓ, if wℓ T and (K + 2)(|T w ℓ| 1) messages between w T ℓand nodes in A. If, in addition, vℓ T , the propagation of (|T v ℓ| 1) h T i j messages between the anchors A and vℓis necessary, plus Kdist(w T ℓ, vℓ) hp i j messages from w T ℓ to vℓ. Therefore, we send O(K(dist(w T ℓ 1, wℓ) + dist(w T ℓ, vℓ) + |T w ℓ|) + |T v ℓ|) messages per iteration. Compare this to the O(K|T |) messages per iteration of standard FMP. To understand the difference in complexity, let s assume for the shake of exposition that |T w ℓ| dist(w T ℓ 1, wℓ), dist(w T ℓ, vℓ), |T v ℓ|. This means that the complexity of adaptive BP is O(K|T w ℓ|), which results in a speedup on the order of O(|T |/|T w ℓ|), since it always holds that |T w ℓ| |T |. Therefore, adaptive BP is consistently faster than standard FMP. 6. Experiments Henceforth, we refer to the proposed algorithm as Ada BP, the method of (S umer et al., 2011) as RCTree BP, and standard BP as BP. We use a publicly available version of RCTree BP. In addition, when we make use of the term consecutive elements , we mean consecutive measurement elements wℓ 1, wℓand concurrent measurement and marginal Adaptive Belief Propagation N 101 102 103 104 Speedup ratio t(BP)/t(Ada BP) t(RCTree BP)/t(Ada BP) N 101 102 103 104 Speedup ratio t(BP)/t(Ada BP) t(RCTree BP)/t(Ada BP) (a) Unconstrained w N 101 102 103 104 Speedup ratio t(BP)/t(Ada BP) t(RCTree BP)/t(Ada BP) N 101 102 103 104 Speedup ratio t(BP)/t(Ada BP) t(RCTree BP)/t(Ada BP) (b) Constrained w Figure 2. Comparison of the total running times of Ada BP against standard BP (gray) and RCTree BP (black) over different alphabet sizes, |X| {2, 10}. (a) Distance between consecutive elements E[dist(wℓ 1, wℓ)] is unconstrained. (b) E[dist(wℓ 1, wℓ)] |X| log N. For average distance E[dist(wℓ 1, wℓ)] smaller than |X| log N, Ada BP is 1.3 4.7 faster than RCTree BP. elements wℓ, vℓ. Recall that updates per iteration in RCTree BP have complexity O(|X|3 log N) (for trees), while complexity is O(|X|2(dist(wℓ 1, wℓ) + dist(wℓ, vℓ))) for Ada BP. Our experiments demonstrate that Ada BP is consistently orders of magnitude faster than standard BP (except in the worst case), and outperforms RCTree BP when the average distance between consecutive elements is less than |X| log N (see Fig. 2(b)). Conversely, if the tree diameter is much greater than |X| log N and the average distance between consecutive elements is comparable to the tree diameter, Ada BP yields worse performance than RCTree BP. We consider the following synthetic experiment where we construct unbalanced trees of sizes N {10, 102, 103, 104}. We repeat the above procedure R = 10 times for each N, by randomly constructing a new tree. For each tree, we randomly generate different w orders of size N and for simplicity of analysis we set v = w, so that only the distance between consecutive measurement nodes affects the computation. Figs. 2(a) and 2(b) compare the ratios of running times of Ada BP against standard BP and RCTree BP (different rows correspond to different alphabet sizes). In all cases both Ada BP and RCTree BP significantly outperform standard BP. Fig. 2(a) considers the case of randomly generated w. When there is no restriction on the distance between consecutive elements, both Ada BP and RCTree BP are comparable. However, for average distance between consecutive elements less than |X| log N, Ada BP is 1.3 4.7 times faster than RCTree BP. Fig. 3(a) and 3(b) consider worst and best case performance of Ada BP, respectively. In the former, we generate several different instances of a Markov chain of varying sizes and con- N 101 102 103 104 Speedup ratio 101 t(BP)/t(Ada BP) t(RCTree BP)/t(Ada BP) (a) E{dist(wℓ 1, wℓ)} O(N) N 101 102 103 104 Speedup ratio t(BP)/t(Ada BP) t(RCTree BP)/t(Ada BP) (b) E{dist(wℓ 1, wℓ)} = 2 Figure 3. (a) Worst case. Distance between consecutive elements is on the order of N. Ada BP is comparable to standard BP (still being 2 4 times faster) and orders of magnitude slower than RCTree BP. (b) Best case. Consecutive elements are very close to each other. Only a constant number of updates is required per step for Ada BP. struct the measurement and marginal orders, w and v such that there is at least 2N/3 distance between consecutive elements. In the latter case, we consider different instances of a star graph (tree diameter: 2) of varying sizes and randomly create measurement and marginal orders (which by construction do not have consecutive elements of more than 2 nodes apart). As expected, in Fig. 3(a), RCTree BP outperforms Ada BP for worst-case w (that is, when there is large distance between consecutive elements), yet still outperforms BP by a factor of 2 4. However, in Fig. 3(b) we see that Ada BP is 4 49 times faster than RCTree BP and up to thousand times faster than BP. Next, we consider application of Ada MP (MP denotes max-product) to biological data. Specifically, we explore the effects of pointwise mutations in DNA sequences to the birth or disappearance of Cp G islands. Cp G islands are regions of DNA with high percentage of cytosine (C) occurring next to guanine (G) nucleotides and are believed to be responsible for upstream gene regulation. Usually, Cp G island detection is modeled as an HMM problem where hidden nodes are binary variables which indicate the presence (or absence) of a Cp G region and observed variables correspond to the observed DNA sequence comprised of the four nucleotides {A,T,C,G}. The goal is to find the MAP sequence (Cp G regions) that best explains the observed data (DNA sequence). In computational mutagenesis, changes in the location of Cp G islands are of interest due to mutations in the DNA sequence (Acar et al., 2009). We compare Ada MP and RCTree MP on varying-size stretches (102 105 bp) of human chromosome 21 obtained from the NCBI database. We train the parameters of HMM with one of the standard Cp G prediction tools, Cp G Island Searcher (Takai & Jones, 2002). We perform a mutation every other nucleotide for each DNA-pair stretch and compare the running times of both methods under different criteria in Fig. 4. In this experiment, vℓ= wℓ, ℓ. Fig. 4(a) shows the speedup of Ada MP over RCTree MP for varying sizes of DNA sequence. For medium to large sequences, Ada MP exhibits Adaptive Belief Propagation 102 103 104 10510-2 104 RCTree MP Ada MP 10-2 100 102 10-2 100 102 10-5 100 105 100 101 102 103 104 105 10-5 100 105 0 200 400 600 800 Update time (sec) RCTree MP (ρ = 0.09) Ada MP (ρ = 0.14) Figure 4. (a) Speedups of Ada MP over RCTree MP for varyingsize stretches of chr 21 (102 105 bp). (a) Left y-axis shows the speedup over RCTree MP, while right y-axis the actual running times in sec (represented as lines). (b) Ratios of update times of Ada MP over RCTree MP for different values of dist(wℓ 1, wℓ) (x-axis: dist(wℓ 1, wℓ), y-axis: speedup). The four log-log plots correspond to four different DNA stretches of 102, 103, 104, 105 bp size, respectively. For smaller distances, Ada MP outperforms, but for distances closer to the graph size N, RCTree MP is preferable. Red line indicates ratio of 1. (c) Both methods are not very sensitive to changes in the MAP sequence between consecutive iterations (x-axis: # of bp that differ between consecutive MAP sequences). better performance, however, for very large sequences of size 105, the computational cost of determining the MAP sequence is nearly linear with the graph size (even though the cost of updating the delta messages remains remarkably low). In contrast, RCTree MP depends only on the number of variables which changed since the previous iteration. Fig. 4(b) examines the relationship in performance to the distance between consecutive elements for DNA stretches of varying size (102 105 bp). As expected, Ada MP is very sensitive to the distance between consecutive elements dist(wℓ 1, wℓ). On the contrary, RCTree MP depends only on the graph size N. Ada MP is preferred for measurement schedules with low average dist(wℓ 1, wℓ) (points above the red line), while RCTree MP average distance comparable to graph size (points below the red line). Lastly, Fig. 4(c) shows that both methods are not very sensitive to changes in the MAP sequence between consecutive iterations. As a second experiment, we analyzed temperature measurements collected from 53 wireless sensors at 30 sec intervals from the Intel Berkeley Research lab. We modeled the latent temperatures in the various locations of the lab as a grid graph. We assume that measurements obtained from a sensor are a noisy representation of the temperatures around its close vicinity. We further assume that temperatures evolve over time following linear dynamics as Xt = AXt 1 + Vt 1, where Vt 1 N(0, Q) and Xt represents the temperatures of the lab at time t. We learn parameters A and Q by training the data between Feb 28 and Mar 7, 2004 on a Normal-inverse-Wishart model. One of the primary goals in this setting is to estimate the covariance of the latent variables after the incorporation of measurements. Since the problem is modeled as a Gaussian HMM, we can use the Kalman filter/smoothing up- Iteration j 50 100 150 200 Speedup ratio t(KF)/t(Ada BP) dist(wℓ, wℓ 1) 20 40 60 80 100 120 140 160 180 30 Ada BP KF Figure 5. (a) This figure shows the speedup over Kalman filter (KF). Ada BP is 1 42 times faster than standard kalman filtering/smoothing techniques. (b) Running time per iteration of Ada BP and KF as a function of consecutive distance between elements. Ada BP is much more sensitive to dist(wℓ 1, wℓ) and as the figure suggests it is much faster than KF when dist(wℓ, wℓ 1) is small. Dotted plots represents deviation due to different runs of the experiment. dates. We use measurements in a 6-hour window on Feb 28, 2014 on a random order and compare the update times of Ada BP versus standard Kalman filter/smoothing updates (RCTree BP is not included for comparison here, since it is not applicable to Gaussian models). We see in Fig. 5(a), that Ada BP is consistently (1 42 times) faster than Kalman filtering/smoothing. Also, in Fig. 5(b), we observe the direct dependence of Ada BP to distance between consecutive elements (wℓ 1, wℓ), (wℓ, vℓ), which makes it more appropriate for problems with small average distance. 7. Discussion We presented a new algorithm, Ada BP, which is particularly suited to sequential inference problems, when there is little or no knowledge of the measurement schedule in advance. In addition, when we can design the measurement order, we show in the supplement how to propose a nearly optimal schedule by casting it as a shortest Hamiltonian path problem. We compared the method to standard BP and RCTree BP. In the case of trees, standard BP incurs a prohibitive cost (O(N) messages per iteration), while Ada BP sends only the necessary messages between consecutive elements. We provided an extensive analysis of the algorithmic complexity with respect to the measurement w and marginal schedule v. We showed that when consecutive measurement and marginal elements wℓ 1, wℓ, vℓare akin to each other, the complexity per iteration is of order O(1) and hence the overall complexity is O(N) (provided the sizes of w, v are close to N). In addition, advanced knowledge of the measurement nodes allows for a nearly minimal schedule design. Lastly, we showed extensions of the algorithm to Gaussian loopy graphs and to the most likely sequence problem (which applies on the full latent graph). An implementation of this algorithm is publicly available at https://github.com/geopapa11/adabp Adaptive Belief Propagation Acknowledgments The authors would like to thank the reviewers for their valuable comments and suggestions. This work was partially supported by the Army Research Office (ARO) Multidisciplinary Research Initiative (MURI) program (Award number W911NF-11-1-0391), NSF/DNDO Collaborative Research ARI-LA (Award ECCS-1348328), and by the Department of Energy (NA-22) Consortium for Verification Technology. Acar, U. A., Ihler, A. T., Mettu, R. R., and S umer, O. Adaptive Updates for MAP Configurations with Applications to Bioinformatics. In IEEE/SP 15th Workshop on Statistical Signal Processing (SSP), August 2009. Bafna, V., Berman, P., and Fujito, T. A 2-Approximation Algorithm for the Undirected Feedback Vertex Set Problem. SIAM Journal on Discrete Mathematics, 12(3): 289 297, Sep 1999. ISSN 0895-4801. Chechetka, A. and Guestrin, C. Focused Belief Propagation for Query-Specific Inference. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), May 2010. Cormen, T. H., Stein, C., Rivest, R. L., and Leiserson, C. E. Introduction to Algorithms. Mc Graw-Hill Higher Education, 3rd edition, 2009. Czumaj, A., Kowaluk, M., and Lingas, A. Faster algorithms for finding lowest common ancestors in directed acyclic graphs. Theoretical Computer Science, 380(1-2): 37 46, July 2007. Darwiche, A. and Hopkins, M. Using recursive decomposition to construct elimination orders, jointrees, and dtrees. In Trends in Artificial Intelligence, Lecture Notes in AI, pp. 180 191. Springer-Verlag, 2001. Fischer, J. and Heun, V. A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array. In Proceedings of the 1st International Conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, ESCAPE 07, pp. 459 470. Springer-Verlag, 2007. Harel, D. and Tarjan, R. E. Fast Algorithms for Finding Nearest Common Ancestors. SIAM Journal on Computing, 13(2):338 355, May 1984. Liu, Y., Chandrasekaran, V., Anandkumar, A., and Willsky, A. S. Feedback Message Passing for Inference in Gaussian Graphical Models. IEEE Transactions on Signal Processing, 60(8):4135 4150, Aug 2012. Pearl, J. Reverend Bayes on inference engines: A distributed hierarchical approach. In Proceedings of the American Association of Artificial Intelligence National Conference (AAAI), pp. 133 136, 1982. Rosales, R. and Jaakkola, T. S. Focused Inference. In Proceedings of the 10th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 317 324, January 2005. S umer, O., Acar, U. A., Ihler, A. T., and Mettu, R. R. Adaptive exact inference in graphical models. Journal of Machine Learning Research, 12:3147 3186, Nov 2011. Takai, D. and Jones, P. A. Comprehensive analysis of Cp G islands in human chromosomes 21 and 22. Proceedings of the National Academy of Sciences (PNAS), 99 (6):3740 3745, March 2002. Wainwright, M. J., Jaakkola, T. S., and Willsky, A. S. MAP estimation via agreement on trees: Message-passing and linear programming. IEEE Transactions on Information Theory, 51:2005, 2005. Weiss, Y. and Freeman, W. T. Correctness of belief propagation in gaussian graphical models of arbitrary topology. Neural Computation, 13:2173 2200, 2001. Wick, M. L. and Mc Callum, A. Query-Aware MCMC. In Advances in Neural Information Processing Systems 24, pp. 2564 2572, 2011.