# interpretable_graphbased_semisupervised_learning_via_flows__9a5f0ba6.pdf

Interpretable Graph-Based Semi-Supervised Learning via Flows

Raif M. Rustamov, James T. Klosowski AT&T Labs Research

In this paper, we consider the interpretability of the foundational Laplacian-based semi-supervised learning approaches on graphs. We introduce a novel ﬂow-based learning framework that subsumes the foundational approaches and additionally provides a detailed, transparent, and easily understood expression of the learning process in terms of graph ﬂows. As a result, one can visualize and interactively explore the precise subgraph along which the information from labeled nodes ﬂows to an unlabeled node of interest. Surprisingly, the proposed framework avoids trading accuracy for interpretability, but in fact leads to improved prediction accuracy, which is supported both by theoretical considerations and empirical results. The ﬂow-based framework guarantees the maximum principle by construction and can handle directed graphs in an out-of-the-box manner.

1 Introduction Classiﬁcation and regression problems on networks and data clouds can often beneﬁt from leveraging the underlying connectivity structure. One way of taking advantage of connectivity is provided by graph-based semi-supervised learning approaches, whereby the labels, or values, known on a subset of nodes, are propagated to the rest of the graph. Laplacian-based approaches such as Harmonic Functions (Zhu, Ghahramani, and Lafferty 2003) and Laplacian Regularization (Belkin, Niyogi, and Sindhwani 2006) epitomize this class of methods. Although a variety of improvements and extensions have been proposed (Zhu 2008; Belkin, Niyogi, and Sindhwani 2006; Zhou and Belkin 2011; Wu et al. 2012; Solomon et al. 2014), the interpretability of these learning algorithms has not received much attention and remains limited to the analysis of the obtained prediction weights. In order to promote accountability and trust, it is desirable to have a more transparent representation of the prediction process that can be visualized, interactively examined, and thoroughly understood. Despite the label propagation intuition behind these algorithms, devising interpretable versions of Harmonic Functions (HF) or Laplacian Regularization (LR) is challenging for a number of reasons. First, since these algorithms operate on graphs in a global manner, any interactive examination of

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

the prediction process would require visualizing the underlying graph, which becomes too complex for even moderately sized graphs. Second, we do not have the luxury of trading prediction accuracy for interpretability: HF/LR have been superseded by newer methods and we cannot afford falling too far behind the current state of the art in terms of the prediction accuracy. Finally, HF and LR possess useful properties such as linearity and maximum principle that are worth preserving. In this paper, we introduce a novel ﬂow-based semisupervised learning framework that subsumes HF and LR as special cases, and overcomes all of these challenges. The key idea is to set up a ﬂow optimization problem for each unlabeled node, whereby the ﬂow sources are the labeled nodes, and there is a single ﬂow destination the unlabeled node under consideration. The source nodes can produce as much out-ﬂow as needed; the destination node requires a unit in-ﬂow. Under these constraints, the ﬂow optimizing a certain objective is computed, and the amount of ﬂow drawn from each of the labeled nodes gives the prediction weights. The objective function contains an ℓ1-norm like term, whose strength allows controlling the sparsity of the ﬂow and, as a result, its spread over the graph. When this term is dropped, we prove that this scheme can be made equivalent to either the HF or LR method. This approach was chosen for several reasons. First, the language of ﬂows provides a detailed, transparent, and easily understood expression of the learning process which facilitates accountability and trust. Second, the sparsity term results in ﬂows that concentrate on smaller subgraphs; additionally, the ﬂows induce directionality on these subgraphs. Smaller size and directedness allow using more intuitive graph layouts (Gansner and North 2000). As a result, one can visualize and interactively explore the precise subgraph along which the information from labeled nodes ﬂows to an unlabeled node of interest. Third, the sparsity term injects locality into prediction weights, which helps to avoid ﬂat, unstable solutions observed with pure HF/LR in highdimensional settings (Nadler, Srebro, and Zhou 2009). Thus, not only do we avoid trading accuracy for interpretability, but in fact we gain in terms of the prediction accuracy. Fourth, by construction, the ﬂow-based framework is linear and results in solutions that obey the maximum principle, guaranteeing that the predicted values will stay within the range of the

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

provided training values. Finally, directed graphs are handled out-of-the-box and different weights for forward and backward versions of an edge are allowed. The main contribution of this paper is the proposed ﬂowbased framework (Section 2). We investigate the theoretical properties of the resulting prediction scheme (Section 3) and introduce its extensions (Section 4). After providing computational algorithms that effectively make use of the shared structure in the involved ﬂow optimization problems (Section 5), we present an empirical evaluation both on synthetic and real data (Section 6).

2 Flow-based Framework

We consider the transductive classiﬁcation/regression problem formulated on a graph G = (V, E) with node-set V = {1, 2, ..., n} and edge-set E = {1, 2, ..., m}. A function f( ) known on a subset of labeled nodes 1, 2, ..., nl needs to be propagated to the unlabeled nodes nl + 1, nl + 2, ..., n. We concentrate on approaches where the predicted values depend linearly on the values at labeled nodes. Such linearity immediately implies that the predicted value at an unlabeled node s is given by

i=1 wi(s)f(i), (1)

where for each i = 1, 2, ..., nl, the weight wi( ) captures the contribution of the labeled node i. In this section, we assume that the underlying graph is undirected, and that the edges e E of the graph are decorated with dissimilarities de > 0, whose precise form will be discussed later. For an unlabeled node s, our goal is to compute the weights wi(s), i = 1, 2, ..., nl. To this end, we set up a ﬂow optimization problem whereby the ﬂow sources are the labeled nodes, and there is a single ﬂow destination: the node s. The source nodes can produce as much out-ﬂow as needed; the sink node s requires a unit in-ﬂow. Under these constraints, the ﬂow optimizing a certain objective function is computed, and the amount of ﬂow drawn from each of the source nodes i = 1, 2, ..., nl gives us the desired prediction weights wi(s). With these preliminaries in mind, we next write out the ﬂow optimization problem formally. We ﬁrst arbitrarily orient each edge of the undirected graph, and let A be n m signed incidence matrix of the resulting directed graph: for edge e, from node i to node j, we have Aie = +1 and Aje = 1. This matrix will be useful for formulating the ﬂow conservation constraint. Note that the ﬂow conservation holds at unlabeled nodes, so we will partition A into two blocks. Rows 1...nl of A will constitute the nl m matrix Al and the remaining rows make up Au; these sub-matrices correspond to labeled and unlabeled nodes respectively. The right hand side for the conservation constraints is captured by the (n nl) 1 column-vector bs whose entries correspond to unlabeled nodes. All entries of bs are zero except the s-th entry which is set to 1, capturing the fact that there is a unit in-ﬂow at the sink node s. For each edge e, let the sought ﬂow along that edge be xe, and let x be the column-vector

with e-th entry equal to xe. Our ﬂow optimization problem is formulated as:

min x Rm 1 2

e=1 de(x2 e + λ|xe|) subject to Aux = bs, (2)

where λ 0 is a trade-off parameter. The prediction weight wi(s) is given by the ﬂow amount drawn from the labeled node i. More precisely, the weight vector is computed as w(s) = Alx for the optimal x. Note that the optimization problem is strictly convex and, as a result, has a unique solution. Furthermore, the optimization problem must be solved separately for each unlabeled s, i.e. for all different vectors bs, and the predicted value at node s is computed via f(s) = nl i=1 wi(s)f(i) = (w(s))Tfl.

Flow Subgraphs Our optimization problem resembles that of elastic nets (Hastie, Tibshirani, and Friedman 2009), and the ℓ1-norm like ﬁrst-order term makes the solution sparse. For a given value of λ, taking the optimal ﬂow s support all edges that have non-zero ﬂow values and nodes incident to these edges we obtain a subgraph Gs(λ) along which the ﬂows get propagated to the node s. This subgraph together with the ﬂow values on the edges constitutes an expressive summary of the learning process, and can be analyzed or visualized for further analysis. Potentially, further insights can be obtained by considering Gs(λ) for different values of λ whereby one gets a regularization path of ﬂow subgraphs. In addition, the optimal ﬂow induces directionality on Gs(λ) as follows. While the underlying graph was oriented arbitrarily, we get a preferred orientation on Gs(λ) by retaining the directionality of the edges with positive ﬂows, and ﬂipping the edges with negative ﬂows; this process completely removes arbitrariness since the edges of Gs(λ) have non-zero ﬂow values. In addition, this process makes the optimal ﬂow positive on Gs(λ); it vanishes on the rest of the underlying graph.

Discussion To understand the main features of the proposed framework, it is instructive to look at the tension between the quadratic and ﬁrst-order terms in the objective function. While the ﬁrst-order term tries to concentrate ﬂows along the shortest paths, the quadratic term tries to spread the ﬂow broadly over the graph. This is formalized in Section 3 by considering two limiting cases of the parameter λ and showing that our scheme provably reduces to the HF when λ = 0, and to the 1-nearest neighbor prediction when λ . Thanks to the former limiting case, our scheme inherits from HF the advantage of leveraging the community structure of the underlying graph. The latter case injects locality and keeps the spread of the ﬂow under control. Limiting the spread of the ﬂow is crucially important for interactive exploration. The ﬂow subgraphs Gs(λ) get smaller with the increasing parameter λ > 0. As a result, these subgraphs, which serve as the summary of the learning process, can be more easily visualized and interactively explored. Another helpful factor is that, with the induced orientation, Gs(λ) is a directed acyclic graph (DAG); see Section 3 for

a proof. The resulting topological ordering gives an overall direction and hierarchical structure to the ﬂow subgraph, rendering it more accessible to a human companion due to the conceptual resemblance with the commonly used ﬂowchart diagrams. Limiting the spread of the ﬂow seemingly inhibits fully exploiting the connectivity of the underlying graph. Does this lead to a less effective method? Quite surprisingly the complete opposite is true. HF and LR have been known to suffer from ﬂat, unstable solutions in high-dimensional settings (Nadler, Srebro, and Zhou 2009). An insightful perspective on this phenomenon was provided in (von Luxburg, Radl, and Hein 2010; 2014) which showed that random walks of the type used in Laplacian-based methods spread too thinly over the graph and get lost in space and, as a result, carry minimal useful information. Therefore, limiting the spread within our framework can be seen as an antidote to this problem. Indeed, as empirically conﬁrmed in Section 6, in contrast to HF/LR which suffer from almost uniform weights, ﬂowbased weights with λ > 0 concentrate on a sparse subset of labeled nodes and help to avoid the ﬂat, unstable solutions.

3 Theoretical Properties In this section we prove a number of properties of the proposed framework. We ﬁrst study its behavior at the limiting cases of λ = 0 and λ . Next, we prove that for all values of the parameter λ, the subgraphs Gs(λ) supporting the ﬂow are acyclic and that the maximum principle holds.

Limiting Behavior Introduce the column-vector d consisting of dissimilarities de, and the diagonal matrix D = diag(d). It is easy to see that the matrix L = AD 1AT is the un-normalized Laplacian of the underlying graph with edge weights (similarities) given by 1/de. As usual, the edge weights enter the Laplacian with a negative sign, i.e. for each edge e = (i, j) we have Lij = 1/dij.

Proposition 1. When λ = 0, the ﬂow-based prediction scheme is equivalent to the Harmonic Functions approach with the Laplacian L = AD 1AT.

Proof. The Harmonic Functions method uses the Laplacian matrix L and constructs predictions by optimizing minf Rn (i,j) E Lij(fi fj)2 subject to reproducing the values at labeled nodes, fi = f(i) for i {1, 2, ..., nl}. By considering the partitions of the Laplacian along labeled and unlabeled nodes, we can write the solution of the HF method as fu = L 1 uu Lulfl, compare to Eq. (5) in (Zhu, Ghahramani, and Lafferty 2003). When λ = 0, the ﬂow optimization problem Eq. (2) is quadratic with linear constraints, and so can be carried out in a closed form (see e.g. Section 4.2.5 of (Boyd et al. 2011)). The optimal ﬂow vector is given by x = D 1AT u(Au D 1AT u) 1bs, and the weights are computed as w(s) = Alx = Al D 1AT u(Au D 1AT u) 1bs. When we plug these weights into the prediction formula (1), we obtain the predicted value at s as f(s) = (w(s))Tfl = b T s(Au D 1AT u) 1Au D 1AT l fl. Since bs is zero except at the position corresponding to s where it is 1 (this is the

reason for the negative sign below), we can put together the formulas for all separate s into a single expression

fu = (Au D 1AT u) 1Au D 1AT l fl.

By using the Laplacian L = AD 1AT and considering its partitions along labeled and unlabeled nodes, we can re-write the formula as fu = L 1 uu Lulfl, giving the same solution as the HF method.

Remark. The converse is true as well: for any Laplacian L built using non-negative edge weights (recall that offdiagonal entries of such Laplacians are non-positive), the same predictions as HF can be obtained via our approach at λ = 0 with appropriate costs de. When the Laplacian matrix L is symmetric, this easily follows from Proposition 1 by simply setting dij = 1/Lij for all (i, j) E. However, when L is not symmetric (e.g. after Markov normalization), we can still match the HF prediction results by setting dij = 2/(Lij + Lji) for all (i, j) E. The main observation is that while L may be asymmetric, it can always be symmetrized without inducing any change in the optimization objective of the HF method. Indeed the objective (i,j) E Lij(fi fj)2 is invariant to setting Lij (Lij + Lji)/2. Now using Proposition 1, we see that by letting dij = 2/(Lij + Lji), our framework with λ = 0 reproduces the same predictions as the HF method.

Proposition 2. When one formally sets λ = , the ﬂow based prediction scheme is equivalent to the 1-nearest neighbor prediction.

Proof. In this setting, the ﬁrst-order term dominates the cost, and the ﬂow converges to a unit ﬂow along the shortest path (with respect to costs de) from the closest labeled node, say i (p), to node p. We can then easily see that the resulting prediction weights are all zero, except wi (p)(p) = 1. Thus, the prediction scheme (1) becomes equivalent to the 1-nearest neighbor prediction.

Acyclicity of Flow Subgraphs Recall that the optimal solution induces a preferred orientation on subgraph Gs(λ); this orientation renders the optimal ﬂow positive on Gs(λ) and zero on the rest of the underlying graph.

Proposition 3. For all λ 0, the oriented ﬂow subgraph Gs(λ) is acyclic.

Proof. Suppose that Gs(λ) has a cycle of the form k1 ... kr k1. The optimal ﬂow x is positive along all of the edges of Gs(λ), including the edges in this cycle; let x0 > 0 be the smallest of the ﬂow values on the cycle. Let xsub be the ﬂow vector corresponding to the ﬂow along this cycle with constant ﬂow value of x0. Note that x xsub is a feasible ﬂow, and it has strictly lower cost than x because the optimization objective is diminished by a decrease in the components of x. This contradiction proves acyclicity of Gs(λ).

Maximum Principle It is well-known that harmonic functions satisfy the maximum principle. Here we provide its derivation in terms of the ﬂows, showing that the maximum principle holds in our framework for all settings of the parameter λ.

Proposition 4. For all λ 0 and for all s, we have i, wi(s) 0 and nl i=1 wi(s) = 1.

Proof. Non-negativity of weights holds because a ﬂow having a negative out-ﬂow at a labeled node can be made less costly by removing the corresponding sub-ﬂow. The formal proof is similar to the proof or Proposition 3. First, make the ﬂow non-negative by re-orienting the underlying graph (which was oriented arbitrarily). Now suppose that the optimal solution x results in wi(s) < 0 for some i, meaning that the labeled node i has an in-ﬂow of magnitude |wi(s)|. Since all of the ﬂow originates from labeled nodes, there must exist j {1, ..., nl}/{i} and a ﬂow-path j k1 ... kr i with positive ﬂow values along the path; let x0 > 0 be the smallest of these ﬂow values. Let xsub be the ﬂow vector corresponding to the ﬂow along this path with constant ﬂow value of x0. Note that x xsub is a feasible ﬂow, and it has strictly lower cost than x because the optimization objective is diminished by a decrease in the components of x. This contradiction proves the positivity of weights. The weights, or equivalently the out-ﬂows from the labeled nodes, add up to 1 because the destination node absorbs a total of a unit ﬂow. More formally, note that by deﬁnition, 1TA = 0. Splitting this along labeled and unlabeled nodes we get 1TA = 1T l Al + 1T u Au = 0. Right-multiplying by x gives 1T l Alx + 1T u Aux = 1T l w(s) + 1T ubs = 1T l w(s) 1 = 0, or nl i=1 wi(s) = 1 as desired.

This proposition together with Eq. (1) guarantees that the function f( ) obeys the maximum principle.

4 Extensions LR and Noisy Labels Laplacian Regularization has the advantage of allowing to train with noisy labels. Here, we discuss modiﬁcations needed to reproduce LR via our framework in the limit λ = 0. This strategy allows incorporating noisy training labels into the ﬂow framework for all λ 0. Consider the objective used by LR: μ 1 nl i=1(fi fi) + (i,j) E Lij(fi fj)2, where fi are the provided noisy labels/values and μ is the strength of Laplacian regularization. In this formulation, we need to learn fi for both labeled and unlabeled nodes. The main observation is that the soft labeling terms μ 1(fi fi)2 can be absorbed into the Laplacian by modifying the data graph. For each labeled node i {1, 2, ..., nl}, introduce a new anchor node ia with a single edge connecting ia to node i with the weight of μ 1/2 (halving offsets doubling in the regularizer sum). Consider the HF learning on the modiﬁed graph, where the labeled nodes are the anchor nodes only i.e. optimize the HF objective on the modiﬁed graph with hard constraints fia = fi; clearly, this is equivalent to LR. Given the relationship between HF and the ﬂow formulation, this modiﬁcation results in a ﬂow formulation of LR at the limiting case of λ = 0.

Directed Graphs Being based on ﬂows, our framework can treat directed graphs very naturally. Indeed, since the ﬂow can only go along the edge direction, we have a new constraint xe 0, giving the optimization problem:

e=1 de(x2 e + λxe) subject to Aux = bs, (3)

Even when λ = 0, this scheme is novel due to the nonnegativity constraint the equivalence to HF no longer holds. The naturalness of our formulation is remarkable when compared to the existing approaches to directed graphs that use different graph normalizations (Zhou, Schölkopf, and Hofmann 2004; Zhou, Huang, and Schölkopf 2005), co-linkage analysis (Wang, Ding, and Huang 2010), or asymmetric dissimilarity measures (Subramanya and Bilmes 2011). This formulation can also handle graphs with asymmetric weights. Namely, one may have different costs for going forward and backward: the cost of the edge i j can differ from that of j i. If needed, such cost differentiation can be applied to undirected graphs by doubling each edge into forward and backward versions. In addition, it may be useful to add opposing edges (with higher costs) into directed graphs to make sure that the ﬂow problem is feasible even if the underlying graph is not strongly connected.

5 Computation The ﬂow optimization problems discussed in Section 4 can be solved by the existing general convex or convex quadratic solvers. However, general purpose solvers cannot use the shared structure of the problem namely that everything except the vector bs is ﬁxed. Here, we propose solvers based on the Alternating Direction Method of Multipliers (ADMM) (Boyd et al. 2011) that allow caching the Cholesky factorization of a relevant matrix and reusing it in all of the iterations for all unlabeled nodes. We concentrate on the undirected version because of the space limitations. In the ADMM form, the problem (2) can be written as

min x,z Rm g(x) + 1

e E dex2 e + λ

subj to x z = 0,

where g(x) is the 0/ -valued indicator function of the set {x|Aux = bs}. Stacking the dissimilarities de into a columnvector d, and deﬁning the diagonal matrix D = diag(d), the ADMM algorithm then consists of iterations:

xk+1 := arg min x g(x) + 1

2 x zk + uk 2 2

zk+1 :=Sdλ/ρ(xk+1 + uk)

uk+1 :=uk + xk+1 zk+1

Here, Sa(b) = (b a)+ ( b a)+ is the component-wise soft-thresholding function. The x-iteration step can be computed in a closed form as the solution of an equality constrained quadratic program, cf. Section 4.2.5 of (Boyd et al. 2011). Letting M = diag(1/(ρ + d)), we ﬁrst solve the linear system

-4 -2 0 2 4

-1.0 -0.5 0.0 0.5 1.0

(a) Harmonic

-4 -2 0 2 4

-1.0 -0.5 0.0 0.5 1.0

(b) p-Voltages, p = 1.1

-4 -2 0 2 4

(c) Iterated, m = 4

-4 -2 0 2 4

-1.0 -0.5 0.0 0.5 1.0

(d) Flow, λ = 0.05

Figure 1: f( ) for a mixture of two Gaussians in R10 for a number of methods.

Au MAT uy = 2ρAu M(zk uk) bs for y, and then let xk+1 := 2ρM(zk uk) MAT uy. The most expensive step is the involved linear solve. Fortunately, the matrix Au MAT u is both sparse (same ﬁll as the graph Laplacian) and positivedeﬁnite. Thus, we compute its Cholesky factorization and use it throughout all iterations and for all unlabeled nodes, since Au MAT u contains only ﬁxed quantities. We initialize the iterations by setting z0 = u0 = 0, and declare convergence when both max |xk+1 zk+1| and max |zk+1 zk| fall below a pre-set threshold. As explained in Section 3.3 of (Boyd et al. 2011), these quantities are related to primal and dual feasibilities. Upon convergence, the prediction weights are computed using the formula w(s) = Alz; thanks to soft-thresholding, using z instead of x avoids having a multitude of negligible non-zero entries.

6 Experiments In this section, we validate the proposed framework and then showcase its interpretability aspect.

6.1 Validation We experimentally validate a number of claims made about the ﬂow based approach: 1) limiting the spread of the ﬂow via sparsity term results in improved behavior in highdimensional settings; 2) this improved behavior holds for real-world data sets and leads to increased predictive accuracy over HF; 3) our approach does not fall far behind the state-of-the-art in terms of accuracy. We also exemplify the directed/asymmetric formulation on a synthetic example. We compare the ﬂow framework to the original HF formulation and also to two improvements of HF designed to overcome the problems encountered in high-dimensional settings: p-Voltages the method advocated in Alamgir and von Luxburg (Alamgir and von Luxburg 2011) (they call it q-Laplacian regularization) and further studied in (Bridle and Zhu 2013; Alaoui et al. 2016); Iterated Laplacians a state-of-the-art method proposed in (Zhou and Belkin 2011). For data clouds we construct the underlying graph as a weighted 20-nearest neighbor graph. The edge weights are computed using Gaussian RBF with σ set as one-third of the mean distance between a point and its tenth nearest neighbor (Chapelle, Schölkopf, and Zien 2006). The normalized graph Laplacian L is used for HF and Iterated Laplacian methods. The edge costs for the ﬂow approach are set by

dij = 2/(Lij +Lji) as described in Section 3. The weights for p-Voltages are computed as d 1/(p 1) ij , see (Alamgir and von Luxburg 2011; Bridle and Zhu 2013).

Mixture of Two Gaussians Consider a mixture of two Gaussians in 10-dimensional Euclidean space, constructed as follows. We sample 500 points from each of the two Gaussians with μ1 = μ2 = 0 and Σ1 = Σ2 = I. The points from each Gaussian are respectively shifted by 1.5 and +1.5 units along the ﬁrst dimension. A single labeled point is used for each Gaussian; the labels are 1 for the left Gaussian and +1 for the right one. According to (Alamgir and von Luxburg 2011; Alaoui et al. 2016), an appropriate value of p for p-Voltages is (10 + 1)/10 = 1.1. In Figure 1, we plot the estimated functions f for various methods. The sampled points are projected along the ﬁrst dimension, giving the x-axis in these plots; the y-axis shows the estimator f. Figure 1 (a) conﬁrms the ﬂatness of the HF solution as noted in (Nadler, Srebro, and Zhou 2009). In classiﬁcation problems ﬂatness is undesirable as the solution can be easily shifted by a few random labeled samples to favor one or the other class. For regression problems, the estimator f is unsuitable as it fails to be smooth, which can be seen by the discrepancy between the values at labeled nodes (shown as red triangles) and their surrounding. The p-Voltages solution, Figure 1 (b), avoids the instability issue, but is not completely satisfactory in terms of smoothness. The iterated Laplacian and ﬂow methods, Figure 1 (c,d), suffer neither from ﬂatness nor instability. Note that in contrast to iterated Laplacian method, the ﬂow formulation satisﬁes the maximum principle, and f stays in the range [ 1, 1] established by the provided labels.

Benchmark Data Sets Next we test the proposed method on high-dimensional image and text datasets used in (Zhou and Belkin 2011), including MNIST 3vs8, MNIST 4vs9, aut-avn, ccat, gcat, pcmac, and real-sim. For each of these, we use a balanced subset of 1000 samples. In each run we use nl = 50 labeled samples; in addition, we withheld 50 samples for validation. The misclassiﬁcation rate is computed for the remaining 900 samples. The results are averaged over 20 runs. For p-Voltages, the value of p for each run is chosen from {1.0625, 1.125, 1.25, 1.5, 2} using the best

Dataset Harmonic p-Voltages Iterated Flow

MNIST 3vs8 8.5 2.4 7.8 1.8 6.1 2.3 6.2 1.6 MNIST 4vs9 22.3 8.2 13.3 2.8 8.5 2.0 9.6 2.7 AUT-AVN 27.9 13.7 19.0 5.1 11.5 2.0 14.6 2.7 CCAT 33.3 8.5 25.4 3.5 21.5 3.4 22.3 3.2 GCAT 20.0 11.6 13.6 3.7 9.2 2.6 10.0 1.7 PCMAC 30.6 11.2 21.4 3.2 14.1 1.6 18.2 3.3 REAL-SIM 31.5 12.1 22.4 3.6 15.3 2.8 17.6 3.3

Table 1: Misclassiﬁcation rates (%) and standard deviations for different methods.

Dataset Harmonic λ = 0.025 λ = 0.05 λ = 0.1 λ = 0.2

MNIST 3vs8 8.5 2.4 5.8 1.6 5.9 1.5 6.2 1.4 6.7 1.4 MNIST 4vs9 22.3 8.2 9.4 2.8 9.3 2.7 9.7 2.7 10.3 2.7 AUT-AVN 27.9 13.7 14.0 2.6 14.8 2.4 16.3 2.3 18.4 2.4 CCAT 33.3 8.5 21.9 3.0 22.5 2.8 23.2 2.5 24.5 2.5 GCAT 20.0 11.6 9.9 1.6 10.8 1.6 12.4 1.8 14.1 1.7 PCMAC 30.6 11.2 17.2 2.9 18.0 2.5 19.5 2.2 21.0 2.2 REAL-SIM 31.5 12.1 17.1 3.4 17.7 3.0 19.1 2.7 21.1 2.4

Table 2: Misclassiﬁcation rates (%) and standard deviations for different settings of the parameter λ.

performing value on the validation samples; note that p = 2 corresponds to HF. For iterated Laplacian method, m is chosen from {1, 2, 4, 8, 16} with m = 1 being equivalent to HF. For the ﬂow approach, the value of λ is chosen from {0, 0.025, 0.05, 0.1, 0.2}, where λ = 0 is equivalent to HF. The results summarized in Table 1 show that the ﬂow approach outperforms HF and p-Voltages, even with the added beneﬁt of interpretability (which the other methods do not have). The ﬂow approach does have slightly lower accuracy than the state-of-the-art iterated Laplacian method, but we believe that this difference is a tolerable tradeoff for applications where interpretability is required. Next, we compare HF and the ﬂow approach but this time instead of selecting λ by validation we use ﬁxed values. The results presented in Table 2 demonstrate that the ﬂow approach for each of the considered values consistently outperforms the base case of HF (i.e. λ = 0). Note that larger values of λ lead to increased misclassiﬁcation rate, but the change is not drastic. This is important for interactive exploration because larger λ values produce smaller ﬂow subgraphs Gs(λ) that are easier for the human companion to grasp. Finally, we show that the improvement over HF observed in these experiments relates to the ﬂatness issue and, therefore, is brought about by limiting the spread of the ﬂow. By analyzing the prediction weights in Eq. (1), we can compare HF to the ﬂow based approach in terms of the ﬂatness of solutions. To focus our discussion, let us concentrate on a single run of MNIST 3vs8 classiﬁcation problem. Since the weights for the ﬂow based and HF approaches lend themselves to a probabilistic interpretation (by Proposition 4, they are nonnegative and add up to 1), we can look at the entropies of the weights, H(s) =

i wi(s) log wi(s) for every unlabeled node s = nl+1, ..., n. For a given unlabeled s, the maximum entropy is achieved for uniform weights: i, wi(s) = 1/nl.

(a) Harmonic

0 100 200 300 400 500

0 100 200 300 400 500

Figure 2: Weight statistics for MNIST 3vs8.

(a) Symmetric Costs

[.75, 1) [.5, .75) [.25, .5) [0, .25) [-.25, 0) [-.5, -.25) [-75, -.5) [-1, -.75)

(b) Asymmetric Costs

[.75, 1) [.5, .75) [.25, .5) [0, .25) [-.25, 0) [-.5, -.25) [-75, -.5) [-1, -.75)

Figure 3: f( ) for symmetric and asymmetric edge costs.

Figure 2 shows the histograms of entropies, together with the red vertical line corresponding to the maximum possible entropy. In contrast to the ﬂow based approach, the entropies from HF method are clustered closely to the red line, demonstrating that there is little variation in HF weights, which is a manifestation of the ﬂatness issue.

Directed/Asymmetric Graphs In this synthetic example, we demonstrate the change in the estimated f( ) induced by the use of asymmetric edge costs in the directed formulation given by Eq. (3). The underlying graph represents the road network of Minnesota, with edges showing the major roads and vertices being their intersections. As in the previous example, we pick two nodes (shown as triangles)) and label them with 1 and depict the resulting estimator f( ) for the remaining unlabeled nodes using the shown color-coding. In Figure 3 (a), the graph is undirected, i.e. the edge costs are symmetric. In Figure 3 (b), we consider the directed graph obtained by doubling each edge into forward and backward versions. This time the costs for edges going from south to north (lati < latj) are multiplied by four. The latter setting makes ﬂows originating at the top labeled node cheaper to travel towards south, and therefore shifts the decision boundary to the south.

6.2 Interpretability As stated previously, one of the core beneﬁts of the proposed framework is how it provides a detailed, transparent, and easily understood expression of the learning process. To exemplify this aspect of the proposed framework, we focus our discussion on a ﬁxed data-point s from MNIST

3vs8 and study the corresponding ﬂow subgraphs Gs(λ) for λ = 0.2, 0.1 and 0.05; see Figure 4. The digit image for unlabeled node s is outlined in red; the labeled node images are outlined in blue. The ﬂows along the edges are in percents, and have been rounded up to avoid the clutter. As expected, the size of the subgraph depends on the parameter λ. Recall that for each λ > 0, the ﬂow optimization automatically selects a subset of edges carrying non-zero ﬂow this is akin to variable selection in sparse regression (Hastie, Tibshirani, and Friedman 2009). The ﬁgure conﬁrms that the larger the strength of the sparsity penalty, the fewer edges get selected, and the smaller is the resulting subgraph. The amount of reduction in size is substantial: the underlying graph for this experiment has 1K nodes and ~15K edges. As claimed, the acyclicity of the ﬂow subgraphs Gs(λ) leads to more intuitive layouts. We used the dot ﬁlter from Graphviz (Gansner et al. 1993; Gansner and North 2000) which is targeted towards visualizing hierarchical directed graphs. Since our ﬂow subgraphs are DAGs, the dot ﬁlter gave satisfactory layouts without any manual tweaking. Indeed, all of the visualizations depicted in Figure 4 provide an overall sense of directionality, here from top to bottom: due to acyclicity there are no edges going backward . This makes it easy to trace the ﬂow from sources, the labeled nodes, to the sink, the unlabeled node s. Of course, the visualizations in Figure 4 beneﬁt from the fact that the data points are digits, which allows including their images in lieu of abstract graph nodes. The same approach could be used for general image data sets as well. For other data types such as text, the nodes could depict some visual summary of the data point, and then provide a more detailed summary upon a mouse hover.

7 Discussion and Future Work We have presented a novel framework for graph-based semisupervised learning that provides a transparent and easily understood expression of the learning process via the language of ﬂows. The nature of interpretability in this work differs from approaches that use feature vectors directly, which provide feature level interpretability revealing how much each feature contributes to the classiﬁcation result. Our framework provides relational interpretability that exposes how each of the data points, labeled and unlabeled, contributes to the classiﬁcation of a test example. Our work is inspired by (Alamgir and von Luxburg 2011) which pioneered the use of the ﬂow interpretation of the standard resistance distances (c.f. (Bollobas 1998)) to introduce a novel class of resistance distances that avoid the pitfalls of the standard resistance distance by concentrating the ﬂow on fewer paths. They also proved an interesting phase transition behavior that led to speciﬁc suggestions for Laplacian-based semi-supervised learning. However, their proposal for semi-supervised learning is not based on ﬂows, but rather on an appropriate setting of the parameter in the p-Voltages method (which they call q-Laplacian regularization); it was further studied in (Bridle and Zhu 2013; Alaoui et al. 2016). Although it is already apparent from the experimental results that our proposal is distinct from p-Voltages, we stress that there is a fundamental theoretical

_________ _________ _________ _________

_________ _________

(a) Flow, λ = 0.2 | 16 nodes, 22 edges

_________ _________ _________

_________ _________

_________ _________

_________ _________

_________ _________ _________

_________ _________

(b) Flow, λ = 0.1 | 38 nodes, 55 edges

_________ _________

_________ _________

_________ _________

_________ _________

_________ _________

_________ _________

_________ _________

_________ _________

_________ _________ _________ _________

(c) Flow, λ = 0.05 | 69 nodes, 105 edges

Figure 4: Flow subgraphs Gs(λ) showing the information propagation from the labeled nodes (outlined in blue) to the unlabeled node s (outlined in red). The ﬂow values on the edges are in percents, and may fail to add up due to rounding.

difference: p-Voltages method is non-linear and so cannot be expressed via Eq. (1), except when p = 2. At a deeper level, these methods regularize the estimated function via its value discrepancies at adjacent nodes, and, therefore, do not directly provide a detailed understanding of how the values propagate along the graph. One immediate direction for future work is to obtain a ﬂow formulation for the iterated Laplacian method which gave best accuracy on the benchmark data sets. This may seem straightforward to do as iterated Laplacian method basically replaces the Laplacian operator L in the regularizer by its power Lm. However, the matrix Lm contains both positive and negative off-diagonal entries, and so, the corresponding edge costs are no longer positive, which renders the ﬂow interpretation problematic. An indirect manifestation of this issue is the lack of a maximum principle for the iterated Laplacian method. Another complicating factor is that the stencils of Lm grow very fast with m, e.g. L4 is nearly a full matrix in the benchmark examples of Section 6. Additional avenues for future study include algorithmic improvements and applications in other areas. The locality of the resulting weights can be used to devise faster algorithms that operate directly on an appropriate neighborhood of the given unlabeled node. Studying the sparsity parameter λ for phase transition behavior can provide guidance for its optimal choice. In another direction, since the ﬂow-based weights are non-negative and add up to 1, they are suitable for semisupervised learning of probability distributions (Solomon et al. 2014). Finally, the ﬂow-based weights can be useful in different areas, such as descriptor based graph matching or shape correspondence in computer vision and graphics.

Acknowledgments. We thank Eleftherios Koutsoﬁos for providing expertise on graph visualization and helping with Graphviz. We are grateful to Simon Urbanek for computational support, and Cheuk Yiu Ip and Lauro Lins for help with generating visualizations in R.

References Alamgir, M., and von Luxburg, U. 2011. Phase transition in the family of p-resistances. In Advances in Neural Information Processing Systems 24, 379 387. Alaoui, A. E.; Cheng, X.; Ramdas, A.; Wainwright, M. J.; and Jordan, M. I. 2016. Asymptotic behavior of ℓp-based Laplacian regularization in semi-supervised learning. In COLT, 879 906. Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR 7:2399 2434. Bollobas, B. 1998. Modern Graph Theory. Graduate texts in mathematics. Heidelberg: Springer. Boyd, S. P.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1):1 122. Bridle, N., and Zhu, X. 2013. p-voltages: Laplacian regularization for semi-supervised learning on high-dimensional

data. Eleventh Workshop on Mining and Learning with Graphs (MLG2013). Chapelle, O.; Schölkopf, B.; and Zien, A., eds. 2006. Semi Supervised Learning. Cambridge, MA: MIT Press. Gansner, E. R., and North, S. C. 2000. An open graph visualization system and its applications to software engineering. Softw. Pract. Exper. 30(11):1203 1233. Gansner, E. R.; Koutsoﬁos, E.; North, S. C.; and Vo, K.-P. 1993. A technique for drawing directed graphs. IEEE Trans. Softw. Eng. 19(3):214 230. Hastie, T. J.; Tibshirani, R. J.; and Friedman, J. H. 2009. The elements of statistical learning : data mining, inference, and prediction. Springer series in statistics. New York: Springer. Nadler, B.; Srebro, N.; and Zhou, X. 2009. Statistical analysis of semi-supervised learning: The limit of inﬁnite unlabelled data. In Advances in Neural Information Processing Systems 22. 1330 1338. Solomon, J.; Rustamov, R.; Leonidas, G.; and Butscher, A. 2014. Wasserstein propagation for semi-supervised learning. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 306 314. Subramanya, A., and Bilmes, J. 2011. Semi-supervised learning with measure propagation. JMLR 12:3311 3370. von Luxburg, U.; Radl, A.; and Hein, M. 2010. Getting lost in space: Large sample analysis of the resistance distance. In Lafferty, J. D.; Williams, C. K. I.; Shawe-Taylor, J.; Zemel, R. S.; and Culotta, A., eds., Advances in Neural Information Processing Systems 23. 2622 2630. von Luxburg, U.; Radl, A.; and Hein, M. 2014. Hitting and commute times in large random neighborhood graphs. Journal of Machine Learning Research 15(1):1751 1798. Wang, H.; Ding, C.; and Huang, H. 2010. Directed graph learning via high-order co-linkage analysis. ECML PKDD 10, 451 466. Wu, X.; Li, Z.; So, A. M.; Wright, J.; and Chang, S. 2012. Learning with partially absorbing random walks. In Advances in Neural Information Processing Systems 25. 3086 3094. Zhou, X., and Belkin, M. 2011. Semi-supervised learning by higher order regularization. ICML 15:892 900. Zhou, D.; Huang, J.; and Schölkopf, B. 2005. Learning from labeled and unlabeled data on a directed graph. In ICML, 1036 1043. Zhou, D.; Schölkopf, B.; and Hofmann, T. 2004. Semisupervised learning on directed graphs. In Advances in Neural Information Processing Systems 17, 1633 1640. Zhu, X.; Ghahramani, Z.; and Lafferty, J. D. 2003. Semisupervised learning using Gaussian ﬁelds and harmonic functions. In ICML, 912 919. Zhu, X. 2008. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison.