# statistical_analysis_of_metric_graph_reconstruction__aa4776a2.pdf Journal of Machine Learning Research 15 (2014) 3425-3446 Submitted 5/13; Revised 6/14; Published 10/14 Statistical Analysis of Metric Graph Reconstruction Fabrizio Lecci lecci@cmu.edu Alessandro Rinaldo arinaldo@cmu.edu Larry Wasserman larry@cmu.edu Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213, USA Editor: Matthias Hein A metric graph is a 1-dimensional stratified metric space consisting of vertices and edges or loops glued together. Metric graphs can be naturally used to represent and model data that take the form of noisy filamentary structures, such as street maps, neurons, networks of rivers and galaxies. We consider the statistical problem of reconstructing the topology of a metric graph embedded in RD from a random sample. We derive lower and upper bounds on the minimax risk for the noiseless case and tubular noise case. The upper bound is based on the reconstruction algorithm given in Aanjaneya et al. (2012). Keywords: metric graph, filament, reconstruction, manifold learning, minimax estimation 1. Introduction We are concerned with the problem of estimating the topology of filamentary data structure. Data sets consisting of points roughly aligned along intersecting or branching filamentary paths embedded in 2 or higher dimensional spaces have become an increasingly common type of data in a variety of scientific areas. For instance, road reconstruction based on GPS traces, localization of earthquakes faults, galaxy reconstruction are all instances of a more general problem of estimating basic topological features of an underlying filamentary structure. The recent paper by Aanjaneya et al. (2012), upon which our work is based, contains further applications, as well as numerous references. To provide a more concrete example, consider Figure 1. The left hand side displays raw data portraying a neuron from the hippocampus of a rat (Guly as et al., 1999). The data were obtained from Neuro Morpho.Org (Ascoli et al., 2007). The right hand side of the figure shows the output of the metric graph reconstruction obtained using the algorithm analyzed in this paper, originally proposed by Aanjaneya et al. (2012). The reconstruction, which takes the form of a graph, captures perfectly all the topological features of the neuron, namely, the relationship between the edges and vertices, the number of branching points and the degree of each node. Metric graphs provide the natural geometric framework for representing intersecting filamentary structures. A metric graph embedded in a D-dimensional Euclidean space (D 2) is a 1-dimensional stratified metric space. It consists of a finite number of points (0dimensional strata) and curves (1-dimensional strata) of finite length, where the boundary c 2014 Fabrizio Lecci, Alessandro Rinaldo and Larry Wasserman. Lecci, Rinaldo and Wasserman of each curve is given by a pair (of not-necessarily distinct) vertices (see the next section for a formal definition of a metric graph). In this paper we study the problem of reconstructing the topology of metric graphs from possibly noisy data, from a statistical point of view. Specifically, we assume that we have a sample of points from a distribution supported on a metric graph or in a small neighborhood and we are interested in recovering the topology of the corresponding metric graph. To this end, we use the metric graph reconstruction algorithm given in Aanjaneya et al. (2012). Furthermore, in our theoretical analysis we characterize explicitly the minimal sample size required for perfect topological reconstruction as a direct function of parameters defining the shape of the metric graph, introduced in Section 2. This leads to an upper bound on the risk of topological reconstruction. Finally, we obtain a lower bound on the risk of topological reconstruction, which, in the noiseless case, almost matches the derived upper bound, indicating that the algorithm of Aanjaneya et al. (2012) behaves nearly optimally. Outline. In Section 2 we formally define metric graphs, the statistical models we will consider and the assumptions we will use throughout. We will also describe several geometric quantities that are central to our analysis. Section 3 contains detailed analysis of the performance of algorithm of Aanjaneya et al. (2012) for metric graph reconstruction, under modified settings and assumptions. In Section 4 we derive lower and upper bounds for the minimax risk of metric graph reconstruction problem. In Section 5 we conclude with some final comments. Related Work. The work most closely related to ours is Aanjaneya et al. (2012) which was, in fact, the motivation for our work. From the theoretical side, we replace the key assumption in Aanjaneya et al. (2012) of the sample being a (ε, R)-approximation to the underlying metric graph, by the milder assumption of the sample being dense in a neighborhood of the metric graph. Approximation and reconstruction of metric graphs has also been considered in Chazal and Sun (2013) and Ge et al. (2011). Metric graph reconstruction is related to the problem of estimating stratified spaces (basically, intersecting manifolds). Stratified spaces have been studied by a number of authors such as Bendich et al. (2010, 2012). A spectral method for estimating intersecting structures is given in Arias-Castro et al. (2011). There are a variety of algorithms for specific problems, for example, see Ahmed and Wenk (2012); Chen et al. (2010) for the reconstruction of road networks. Finally, Chernov and Kurlin (2013) derived an alternative algorithm that uses ideas from homology. 2. Background and Assumptions The assumptions in Aanjaneya et al. (2012) lead to a reconstruction process that is aimed at capturing the intrinsic structure of the data and is somewhat oblivious to its extrinsic embedding. The authors assume that the sample comes with a metric that is close to the intrinsic metric of the underlying graph, by imposing a limit on the Gromov-Hausdorff distance between the two metrics. By considering data embedded in the Euclidean space and focusing on the topological aspect, we show that the notion of dense sample is sufficient to guarantee a correct reconstruction. Statistical Analysis of Metric Graph Reconstruction Figure 1: Left: Neuron cr22e from the hippocampus of a rat; Neuro Morpho.Org (Ascoli et al., 2007). Right: A metric graph reconstruction of the neuron. In this section we provide background on metric graph spaces and describe the assumptions and the geometric parameters that we will be using throughout. Informally, a metric graph is a collection of vertices and edges glued together in some fashion. Here we state the formal definitions of path metric space and metric graph. For more details see Aanjaneya et al. (2012) and Kuchment (2004). Definition 1 A metric space (G, d G) is a path metric space if the distance between any pair of points is equal to the infimum of the lengths of the continuous curves joining them. A metric graph is a path metric space (G, d G) that is homeomorphic to a 1-dimensional stratified space. A vertex of G is a 0-dimensional stratum of G and an edge of G is a 1-dimensional stratum of G. We will consider metric graphs embedded in RD. Note that, if one ignores the metric structure, namely the length of edges and loops, the shape or topology of a metric graph (G, d G) is encoded by a graph, whose vertices and edges correspond to vertices and edges of G. Since we allow for two vertices to be connected by more than one edge we are actually dealing with pseudographs. We recall that an undirected pseudograph (V, E) is a set of vertices V , a multiset E of unordered pairs of (not necessarily distinct) vertices. To a given pseudograph we can associate a function f : E V V , which, when applied to an edge e E, simply extracts the vertices to which e is adjacent. Thus, if e1, e2 E are such that f(e1) = f(e2), then e1 and e2 are parallel edges. Similarly, if e E is such that f(e) = {v, v} for some v V , then e is a loop. For each pair (u, v) V V , let ν(u, v) = |f 1({u, v})| if {u, v} E and 0 otherwise. In particular, ν(u, v) is the number of edges between u and v (or loops if u = v). Lecci, Rinaldo and Wasserman We say that a metric graph reconstruction algorithm perfectly recovers the topology of G if outputs a pseudograph isomorphic to the pseudograph representing the topology of G. We now define some key quantities regarding the structure of a metric graph. We start with the definition of reach. Let M be a 1-dimensional manifold embedded in RD. Let Tu M denote the 1-dimensional tangent space to M and let T u M be the (D 1)-dimensional normal space. Definition 2 Define the fiber of size a at u M to be La(u, M) = T u M T B(u, a), where B(u, a) is the D-dimensional ball of radius a centered at u. If M has boundary {v1, v2}, the fiber of size a at vi is defined as the limit of La(u, M), as u approaches vi in M\{v1, v2}. The reach of M is the largest number τ such that the fibers Lτ(u, M) never intersect. The reach sets a limit on the curvature of a manifold. A manifold with large reach does not come too close to be self-intersecting. For example the reach of an arc of a circle is equal to its radius. The quantity 1/τ is called the condition number in Niyogi et al. (2008). For more details see also Federer (1959); Chazal and Lieutier (2006); Genovese et al. (2012a). Each edge of a metric graph (G, d G) can be seen as a 1-dimensional manifold with boundary. Let the local reach of metric graph G be the minimum reach associated to an edge of G. When 2 edges intersect at a vertex v they create an angle, where the angle between two intersecting curves is formally defined as follows. Suppose that e1 and e2 intersect at x. Let B(x, ϵ) be the D-dimensional ball of radius ϵ centered at x. Let ℓ1(ϵ) be the line segment joining the two points x and B(x, ϵ) T e1. Let ℓ2(ϵ) be the line segment joining the two points x and B(x, ϵ) T e2. Let αϵ(e1, e2) be the angle between ℓ1(ϵ) and ℓ2(ϵ). The angle between e1 and e2 is α(e1, e2) = limϵ 0 αϵ(e1, e2). We assume that, for each pair of intersecting edges e1 and e2, the angle α(e1, e2) is well-defined. To control points far away in the graph distance, but close in the embedding space, we define AG = {(x, x ) G G : d G(x, x ) min(b, τα)}, where b is the shortest edge of G, τ is the local condition number and α is the smallest angle formed by two edges of G. We define the global reach as the infimum of the Euclidean distances among pairs of point in AG, that is ξ = inf AG x x 2. Let (G, d G) be a metric graph and, for a constant σ 0, let Gσ = {y : infx G ||x y||2 σ} be the σ-tube around G. If σ = 0, then, trivially, Gσ = G. Notice that Gσ is a set of dimension D if σ > 0. We will use the assumption that the sample Y is sufficiently dense in Gσ with respect to the Euclidean metric, as formalized below. Definition 3 The sample Y = {y1, . . . , yn} Gσ RD is δ 2-dense in Gσ if for every x Gσ, there exists a y Y such that x y 2 < δ The problem of metric graph reconstruction consists of reconstructing a metric graph G given a dense sample {y1, . . . , yn} = Y Gσ endowed with a distance d Y, which could be the D-dimensional Euclidean distance or some more complicate notion of distance. If σ = 0 we say that the sample Y is noiseless, while if σ > 0, we say that Y is a noisy sample. Throughout our analysis we restrict the attention to metric graphs embedded in RD that satisfy the following assumptions: Statistical Analysis of Metric Graph Reconstruction A1 The graphs have finite total length and are free of nodes of degree 2 (though they may contain vertices of degree 1 or 3 and higher). A2 Each edge is a smooth embedded sub-manifold of dimension 1, of length at least b > 0 and with reach at least τ > 0. A3 Each pair of intersecting edges forms a well-defined angle of size at least α > 0. A4 The global reach is at least ξ > 0. Assumptions A1 and A2 allow us to consider each edge of a metric graph as a single smooth curve. A3 and A4 are additional regularity conditions on the separation between different edges. Assumptions similar to A1-A4 are common in the literature. For different regularity conditions that allow for corners within an edge see, for example, Chazal et al. (2009) and Chen et al. (2010). Let G be the set of metric graphs embedded in RD that satisfy assumptions A1, A2, A3 and A4, involving the parameters b, α, τ, ξ. We consider two noise models: Noiseless. We observe data Y1, . . . , Yn P, where P P, a collection of probability distributions supported over metric graphs (G, d G) in G having densities p with respect to the length of G bounded from below by a constant a > 0. Tubular Noise. We observe data Y1, . . . , Yn PG,σ where PG,σ is uniform on the σ-tube Gσ. In this case we consider the collection P = {PG,σ : G G}. We are interested in bounding the minimax risk Rn = inf b G sup P P P n b G G , (1) where the infimum is over all estimators b G of the topology of (G, d G), the supremum is over the class of distributions P for Y and b G G means that b G and G are not isomorphic. In Section 4 we will find lower and upper bounds for Rn in the noiseless case and the tubular noise case. We conclude this section by summarizing the many parameters and symbols involved in our analysis. See Table 1. 3. Performance Analysis for the Algorithm of Aanjaneya et al. (2012) In this section we study the performance of the metric graph reconstruction algorithm of Aanjaneya et al. (2012), under assumptions A1-A4 and with a choice of parameters adapted to our setting. In Section 4 we will use these results to derive bounds on the minimax rate for topology reconstruction. The metric graph reconstruction algorithm is presented in Algorithm 1. The algorithm takes a (possibly noisy) sample Y from a metric graph G and a distance d Y defined on Y and returns a graph b G that approximates G. The key idea is the following: Lecci, Rinaldo and Wasserman Symbol Meaning (G, d G) metric graph α smallest angle b shortest edge τ local reach ξ global reach G set of metric graphs embedded in RD, satisfying A1-A4 P set of distributions on G or Gσ Gσ σ tube around G Y sample, subset of Gσ δ Y is a δ/2-dense sample Table 1: Summary of the symbols used in our analysis. a shell of radius r is constructed around each point in the sample, which is labeled edge point if its shell contains 2 well separated clusters of sampled points and vertex point otherwise. Several steps of the algorithm require the construction of a Rips-Vietoris graph of parameter δ: Rδ(Sy) is a graph whose vertices are all the points of Sy and there is an edge between two points if the Euclidean distance between them is not larger than δ. At Step 11 some of the edge points that are close to vertices are re-labeled as vertex points. This expansion guarantees a precise borderline between clusters of vertex points and clusters of edge points. At steps 15-17 each of these clusters is associated to a vertex or to an edge of the reconstructed graph b G. We will analyze the algorithm considering the Euclidean Algorithm 1 Metric Graph Reconstruction Algorithm Input: sample Y, d Y, r, p11. 1: Labeling points as edge or vertex 2: for all y Y do 3: Sy B(y, r + δ)\B(y, r) 4: degr(y) Number of connected components of Rips-Vietoris graph Rδ(Sy) 5: if degr(y) = 2 then 6: Label y as a edge point 7: else 8: Label y as a preliminary vertex point. 9: end if 10: end for. 11: Label all points within Euclidean distance p11 from a preliminary vertex point as vertices. 12: Let E be the point of Y labeled as edge points. 13: Let V be the point of Y labeled as vertices. 14: Reconstructing the graph structure 15: Compute the connected components of the Rips-Vietoris graphs Rδ(E) and Rδ(V). 16: Let the connected components of Rδ(V) be the vertices of of the reconstructed graph b G. 17. Let there be an edge between vertices of b G if their corresponding connected components in Rδ(V) contain points at distance less than δ from the same connected component of Rδ(E). Output: b G. Statistical Analysis of Metric Graph Reconstruction distance on the sample Y, that is, d Y = 2. The inner radius of the shell at Step 3 and the width of the expansion at Step 11 are parameters the user has to specify. Before finding how dense a sample has to be in order to guarantee a correct reconstruction of a metric graph, we show that it is sufficient to study a particular metric graph embedded in R2, which represents the worst case. In other words, if the metric graph algorithm can reconstruct this particular planar graph, then it can reconstruct any other metric graph that satisfies A1-A4. 3.1 The Worst Case: a Metric Graph in R2 The worst case is the one for which it is hard to distinguish two edges that intersect at a vertex because they are too close in the embedding space. Figure 2 (top left) shows an edge e that intersects two edges e1, e2 with reach τ, forming an angle α at vertex x. In the plots, the embedding space is R3 (D = 3) and we show the projections of e, e1 and e2 on the (limit) plane formed by e1 and e2, passing through x. Figure 2: Even in the worst case, edges e1 and e2 must lie outside of the torii constructed on the fibers Lτ(x, e1) and Lτ(x, e2). We focus on edge e2. The fiber Lτ(x, e2) of size τ around x is a (D 1)-dimensional ball centered at x and orthogonal to e2. In R3, Lτ(x, e2) is a disk of radius τ, whose projection on the plane is the segment AB. By definition, for any y e2, the fiber Lτ(y, e2) can not intersect the fiber Lτ(x, e2), otherwise the assumption involving the reach τ would be violated. We represent this condition by considering a D-dimensional ball of radius τ, centered at each point of the boundary of Lτ(x, e2). Edge e2 must lie outside of these balls, Lecci, Rinaldo and Wasserman in a feasible region that we denote by R2, so that its fibers do not intersect Lτ(x, e2). In R3, this procedure forms a horn torus (a torus with no hole) around vertex x. See the top right plot of Figure 2. The same reasoning applies to edge e1, which must lie in the region R1, outside of the balls of radius τ centered at each point of the boundary of Lτ(x, e1). See the bottom left plot. At each given distance from vertex x, two points of e1 and e2 are as close as possible when they lie on the boundaries of R1 and R2, on the same (limit) plane formed by e1 and e2, passing through x. When e1 and e2 lie on this plane, on the boundaries of the two feasible regions, they are as close as possible in the embedding space. This worst case is represented in the bottom right plot of Figure 2. Note that e1 and e2 are simply arcs of circles of radius τ. Figure 3: Left: edges e1 and e2 with minimum reach τ forming the smallest angles α at vertex x. Right: same metric graph with a tube of radius σ around it. We will use basic trigonometric properties of the worst case. In Figure 3 (left), O and O are the centers of the circles associated to edges e1 and e2. It is easy to see that angle Obx O has width π α. It can be shown that x b OO = α/2, (2) T bx Q = α/4. (3) Let Y be a noisy sample of G. In other words Y is a subset of Gσ, the tube of radius σ 0 around the metric graph G. See Figure 3 (right). Let Q be the midpoint of segment OO and let T be the intersection point of OO and edge e1. For 0 σ QT = τ τ cos(α/2), the smallest angle formed by the inner faces of the tube around the metric graph is α = π arccos 2(τ σ)2 4τ 2 cos2(α/2) 2(τ σ)2 , (4) where we applied the cosine law to the triangle Os O and the fact that angle Obs O has width π α . Note that if σ = 0 then α = α. As in (3), it can be shown that Rbs Q = α /4. (5) Statistical Analysis of Metric Graph Reconstruction The few basic trigonometric equations described above will be used to determine under which conditions on b, α, τ, ξ, σ the metric graph reconstruction algorithm can reconstruct the worst case. 3.2 Analysis of Algorithm 1 with Euclidean Distance In this section we analyze Algorithm 1. The Euclidean distance is used at every step of the algorithm, which requires the specification of r, the inner radius of the shell, and p11, the parameter governing the expansion of Step 11. We set 2 + σ + τ sin(α/2) (τ σ) sin(α /2) + δ 2 sin(α /4) (6) and p11 = δ 2 + τ sin(α/2) (τ σ) sin(α /2) + r + δ sin(α /2) (7) This choice is justified in the proof of Proposition 4. Define f(b, α, τ, ξ, σ) := (τ σ) sin min(b,ατ) (α α )τ 2τ [τ sin(α/2) (τ σ) sin(α /2)] 1 + 2 sin(α /2) 2σ sin(α /2) 1 + 3[sin(α /2)] 1 + [sin(α /2) sin(α /4)] 1 , where α is given in 4. Note that f(b, α, τ, ξ, σ) is a decreasing function of σ. Proposition 4 If Y is δ 2-dense in Gσ and 0 < r + δ < ξ 2σ, (9) 0 < δ < f(b, α, τ, ξ, σ), (10) then the graph b G provided by Algorithm 1 (input: Y, 2, r, p11) is isomorphic to G. Proof Our objective is to use Algorithm 1 to reconstruct edges and vertices of a metric graph G embedded in RD. Condition (9) guarantees that points of G which are far apart in the metric graph distance d G, and close in the embedding space, do not interfere in the construction of the shells at Steps 3-4. Therefore we can restrict the attention to adjacent edges in a neighborhood of the vertex at which they intersect. In particular, since Algorithm 1 is based on the Euclidean distance between the edges, if we can distinguish two adjacent edges that are as close as possible in the embedding space, then we can distinguish any other pair of adjacent edges. As shown in Section 3.1, two adjacent edges e1 and e2, forming an angle of width α at vertex x, are as close as possible when they lie on the same plane, on the boundaries of the feasible regions determined by the condition on the reach (assumption A2). We will show that under conditions (9) and (10), Algorithm 1 can reconstruct this worst case. This will imply that the algorithm can reconstruct the topology of other vertices and edges in the D-dimensional space. Lecci, Rinaldo and Wasserman The rest of the proof involves condition (10). Since the sample is δ 2-dense in the tube, there is at least a point y Y inside the ball of radius δ 2 centered at any vertex x G. When using Algorithm 1 we want to be sure that y is labeled as a vertex, that is, the number of connected components of the shell around y is different than 2 (Steps 3-4). The worst case is depicted in Figure 4 (left), where x is the vertex of minimum angle α, formed by two edges, e1 and e2 of reach τ. First, we show that for the the value of r selected in (6), points close to an actual vertex are labeled as vertices at Steps 3-10 and points far from actual vertices are labeled as edges. The inner faces of the tube of radius σ around e1 and Figure 4: Left: edges e1 and e2 with minimum reach τ forming the smallest angles α at vertex x. Right: The distance F G 2 between the two connected components of the shell around an edge point y must be greater than δ. e2 form an angle of width α at vertex s, as described in Section 3.1. Let u and v be the two points on the faces of the tube such that they are equidistant from x and u v 2 = δ. Since at Step 4 we construct a δ-graph to determine the number of connected components of the shell Sy and we want y to be a vertex, we choose r, the inner radius of the shell Sy, so that if u, v Y then r max{d Y(y, u), d Y(y, v)}. This guarantees that t1, t2 Y with t1 around edge e1, t2 around edge e2 such that {t1, t2} Sy, we have d Y(t1, t2) δ, that is t1 and t2 belong to different connected components of the shell around y at Step 4. The distance between y and u is bounded by y x 2 + x s 2 + s u 2, where, using (2), x s 2 = x Q 2 s Q 2 = τ sin(α/2) (τ σ) sin(α /2) and using (5), s u 2 δ 2 sin(α /4). (11) Therefore we require that r, the inner radius of the shell of Step 4 satisfies 2 + x s 2 + δ 2 sin(α /4) (12) y x 2 + x s 2 + s u 2. Statistical Analysis of Metric Graph Reconstruction Another condition on r arises when we label edge points far from actual vertices. See Figure 4 (right). If y Y, then it should be labeled as an edge point. That is, at Step 4, the Rips graph Rδ(Sy ) on the shell Sy should have 2 connected components. Therefore the distance F G 2 between them must be greater than δ. We require that which implies F G 2 > δ when r is small enough, as implied by (10). Note that the value r = δ 2 + σ + x s 2 + δ 2 sin(α /4) satisfies both (12) and (13). The outer radius of the shell at Steps 3-4 has length r + δ. This guarantees that when the shell around an edge point intersects the tube around G there is at least a point y Y in each connected component of the shell, since Y is δ 2-dense in Gσ. In the last part of this proof we show that condition (10) is needed to guarantee that the sample is dense enough and the radius of the shells of Step 3 has the correct size, so that, even in the worst case, each vertex is associated to one set of sampled points at Steps 15-17 and these connected components are correctly linked by sets of sampled points labeled as edge points. Let z Gσ be the point around e2 where the segment of length r + δ, orthogonal to the face of the tube around edge e1, intersects the face of the tube around edge e2. See Figure 5. If this segment does not exist we simply consider the segment of length r + δ from s to a point z on e2. Figure 5: The shell around z is tangent to edge e2. Suppose z Y. Among the points that might be labeled as vertices at Step 6 because of their closeness to vertex x, z is the furthest from x, since the shell around z is tangent to the tube around e1. At Step 11, in order to control the labeling of the points in the tube between y and z we would like to label all the points in {y Y : y y 2 y z 2} as vertices. To simplify the calculation we use the following bound y z 2 y x 2 + x s 2 + s z 2, where, using (5), s z 2 r + δ sin(α /2). (14) Lecci, Rinaldo and Wasserman This justifies the choice of p11 = δ 2 + x s 2 + r + δ sin(α /2) y z 2. Thus, at Step 11 we label all the points in {y Y : y y 2 p11 and y is labeled as vertex at Step 6 } as vertices. If z is actually labeled as a vertex at Step 6, then through the expansion of Step 11, all the points at distance not greater than p11 from z are labeled as vertices. Finally we determine under which conditions there is at least a point in the tube around e2 labeled as an edge point after Step 11. Consider the worst case in which e1 and e2 are forming an angle of size α at both their extremes x and x . See Figure 6. Figure 6: Edges e1 and e2, forming an angle of size α at both their extremes x and x . All the points y Y such that y z 2 p11 or y z 2 p11 might be labeled as vertices. When we construct R(E)δ and R(V)δ at Step 15 the two sets of vertices around x and x must be disconnected and there must be at least an edge point between them. A sufficient condition is that the length of edge e2 is greater than 2(a1 + a2 + a3) + a4, where a1 is the length of the arc of e2 formed by the projections of lines Ox and Os on e2, a2 is the length of the arc of e2 formed by the projection of the chord of length s z 2, a3 is the length of the arc of e2 formed by the projection of the chord of length p11, a4 is the length of the arc of e2 formed by the projection of the chord of length δ. Note that, in Figure 6, e2 = 2τ arcsin x x 2 2τ = ατ but in general it might be shorter, so that e1 and e2 might not intersect in x . However, by assumptions A2, e2 must be longer than b. Thus we require min(b, ατ) > 2(a1 + a2 + a3) + a4. (15) By simple properties involving arcs and chords we have τ, a2 = 2τ arcsin s z 2 a3 = 2τ arcsin p11 2(τ σ) , a4 = 2τ arcsin δ 2(τ σ) Statistical Analysis of Metric Graph Reconstruction Since the arcsin is superadditive in [0, 1] we require the stronger condition min(b, ατ) (α α )τ > 2τ arcsin 2 s z 2 + 2p11 + δ which holds if sin min(b, ατ) (α α )τ > 2 r+δ sin(α /2) + 2p11 + δ The last condition is equivalent to (10). If this condition is satisfied then the graph is correctly reconstructed at Steps 15-17: every connected component of Rδ(V) corresponds to a vertex of G and every connected component of Rδ(E) corresponds to an edge of G. Example 1 A Neuron in Three-Dimensions. We return to the neuron example and we try to apply Propositions 4 to the 3D data of Figure 1, namely the neuron cr22e from the hippocampus of a rat (Guly as et al., 1999). The data were obtained from Neuro Morpho.Org (Ascoli et al., 2007). The total length of the graph is 1750.86µm. We assume the smallest edge has length 100µm, the smallest angle π/3, the local reach 30µm and ξ = 50µm. The conditions of Proposition 4 are satisfied for δ = 2.00µm. Algorithm 1 reconstructs the topology of the metric graph starting from a δ/2-dense sample. Figure 1b shows the reconstructed graph. 4. Minimax Analysis In this section we derive lower and upper bound for the minimax risk Rn = inf b G sup P P P n b G G , (16) where, as described in Section 2, the infimum is over all estimators b G of the metric graph G, the supremum is over the class of distributions P for Y and b G G means that b G and G are not isomorphic. 4.1 Lower Bounds To derive a lower bound on the minimax risk, we make repeated use of Le Cam s lemma. See, e.g., Yu (1997) and Chapter 2 of Tsybakov (2008). Recall that the total variation distance between two measures P and Q on the same probability space is defined by TV(P, Q) = sup A |P(A) Q(A)| where the supremum is over all measurable sets. It can be shown that TV(P, Q) = P(H) Q(H), where H = {y : p(y) q(y)} and p and q are the densities of P and Q with respect to any measure that dominates both P and Q. Lemma 5 (Le Cam) Let Q be a set of distributions. Let θ(Q) take values in a metric space with metric ρ. Let Q1, Q2 Q be any pair of distributions in Q. Let Y1, . . . , Yn be drawn iid from some Q Q and denote the corresponding product measure by Qn. Then inf bθ sup Q Q EQn h ρ(bθ, θ(Q)) i 1 8ρ(θ(Q1), θ(Q2))(1 TV(Q1, Q2))2n (17) where the infimum is over all the estimators of θ(Q). Lecci, Rinaldo and Wasserman Below we apply Le Cam s lemma using several pairs of distributions. Any pair Q1, Q2 is associated with a pair of metric graphs G , G G. We take θ(Q1) and θ(Q2) to be the classes of graphs that are isomorphic to G and G . We set ρ(θ(Q1), θ(Q2)) = 0 if G and G are isomorphic and ρ(θ(Q1), θ(Q2)) = 1 otherwise. Figure 7 shows several pairs of metric graphs that are used to derive lower bounds in the noiseless case and in the tubular noise case. In the noiseless case we ignore the σ-tubes around the metric graphs. Figure 7: Pairs of metric graphs used in the derivation of lower bounds in the noiseless case and in the tubular noise case. Recall that, in the noiseless case, we restrict the attention to probability distributions supported over metric graphs (G, d G) in G, having densities p with respect to the length of G bounded from below by a constant a > 0. Theorem 6 In the noiseless case (σ = 0), for b b0(a), α α0(a), ξ ξ0(a), τ τ0(a), where b0(a), α0(a), ξ0(a) and τ0(a) are constants which depend on a, a lower bound on the minimax risk for metric graph reconstruction is Rn exp 2a min{b, 2 sin(α/2), ξ, 2πτ}n . (18) Proof We consider the 4 parameters separately. See Figure 7, ignoring the red lines representing the tubular noise that is not considered in this theorem. Shortest edge b. Consider the metric graph G1 consisting of a single edge of length 1+b and metric graph G2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. The density on G1 is constructed in the following way: on the set G1\G2 of length b we set p1(x) = a and the rest of the mass is evenly distributed over the remaining portion of G1. Similarly, for G2 we set p2(x) = a on G2\G1, which correspond to the orthogonal edge of length b. We evenly spread the remaining mass. The two densities differ Statistical Analysis of Metric Graph Reconstruction only on the sets G1\G2 and G2\G1. Therefore TV(p1, p2) ab and, by Le Cam s lemma, Rn 1 8(1 ab)2n 1 8e 2abn for all b b0(a), where b0(a) is a constant depending on a. Smallest angle α. Now consider the metric graphs G3 and G4. G3 consists of two edges of length 2 forming an angle α and a third edge of length 1+2 sin(α/2) glued to the first two. G4 is similar: an edge of length 2 sin(α/2) is added to complete the triangle, while the edge on the left has length 1. As in the previous case we set p3(x) = a on G3\G4, p4(x) = a on G4\G3 and spread evenly the rest of the mass. The total variation distance is TV(p3, p4) 2a sin α 2 and, by Le Cam s lemma, Rn 1 8(1 2a sin (α/2))2n 1 8e 4a sin(α/2)n for all α α0(a), where α0(a) is a constant depending on a. Global reach ξ. We defined the global reach as the shortest Euclidean distance between two points that are far apart in the graph distance. Figure 7 shows metric graph G5 formed by a single edge of length 1 and metric graph G6 consisting of two edges of length 0.5, ξ apart from each other. Again, we set p5(x) = a on G5\G6, p6(x) = a on G6\G5 and evenly spread the rest. We obtain TV(p5, p6) aξ and, by Le Cam s lemma, Rn 1 8(1 aξ)2n 1 for all ξ ξ0(a), where ξ0(a) is a constant depending on a. Local reach τ. The local reach τ is the smallest reach of the edges forming the metric graph. Consider metric graphs G7 and G8. G7 consists of a loop of radius τ attached to an edge of length 1 and metric graph G8 is a single edge of length 1 + 2πτ. As in the previous cases p7(x) = a on G7\G8 and p8(x) = a on G8\G7. It follows that TV(p7, p8) 2aπτ and, by Le Cam s lemma, Rn 1 8(1 2aπτ)2n 1 8e 4aπτn for all τ τ0(a), where τ0(a) is a constant depending on a. For the tubular noise case we assume that σ is small enough to guarantee that Rn < 1, that is, the problem is not hopeless. In particular, we require that σ satisfies conditions (9) and (10) of Proposition 4, which can be combined into the following condition 0 < min ξ 3σ τ sin(α/2) + (τ σ) sin(α /2) 3/2 + [2 sin(α /4)] 1 , f(b, α, τ, ξ, σ) . (19) Theorem 7 Assume that σ is positive and satisfies condition (19). In the tubular noise case, for b b0(D), α α0(D), ξ ξ0(D), τ τ0(D), where b0(D), α0(D), ξ0(D) and τ0(D) are constants which depend on the ambient dimension D, a lower bound on the minimax risk for metric graph reconstruction is 8 exp 2 min{CD,1b, CD,2 sin(α/2), CD,3ξ, CD,4τ}n , (20) for some constants CD,1, CD,2, CD,3, CD,4. Proof As in the proof oh Theorem 6 we consider the 4 parameters separately. We compare the pairs of graphs shown in Figure 7, including the tubular regions constructed around them, from which we get samples uniformly. Shortest edge b. Consider the metric graph G1 consisting of a single edge of length 1+b and metric graph G2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. Since vol(G1) > vol(G2), the density q1 at a point in the tube around G1 is lower than the density q2 at a point around G2. From the definition of total variation Lecci, Rinaldo and Wasserman TV = q1(H) q2(H) where H is the set where q1 > q2, the shaded area in Figure 7. Note that q2(H) = 0 and TV (q1, q2) = q1(H) = vol(H) vol(G1) CD,1 bσD 1 (1 + b)σD 1 CD,1b. By Le Cam s lemma, Rn 1 8(1 CD,1b)2n 1 8e 2CD,1bn for all b b0(D), where b0(D) is a constant depending on D. Smallest angle α. Now consider the metric graphs G3 and G4. Since vol(G3) > vol(G4), the density q3 at a point in the tube around G3 is lower than the density q4 at a point around G4. TV = q3(H) q4(H) where H is the set where q3 > q4, the shaded area in the tube around G3. Note that q4(H) = 0 and TV (q3, q4) = q3(H) = vol(H) vol(G3) CD,2 sin(α/2)σD 1 (1 + sin(α/2))σD 1 CD,2 sin(α/2). By Le Cam s lemma, Rn 1 8(1 CD,2 sin(α/2))2n 1 8e 2CD,2 sin(α/2)n for all α α0(D), where α0(D) is a constant depending on D. Global reach ξ. Figure 7 shows metric graph G5 formed by a single edge of length 1 and metric graph G6 consisting of two edges of length 0.5, ξ apart from each other. Since vol(G5) > vol(G6), the density q5 at a point in the tube around G5 is lower than the density q6 at a point around G6. TV = q5(H) q6(H) where H is the set where q5 > q6, the shaded area in the tube around G5. Note that q6(H) = 0 and TV (q5, q6) = q5(H) = vol(H) vol(G5) CD,3 ξσD 1 σD 1 = CD,3ξ. By Le Cam s lemma, Rn 1 8(1 CD,3ξ)2n 1 8e 2CD,3ξn for all ξ ξ0(D), where ξ0(D) is a constant depending on D. Local reach τ. The local reach τ is the smallest reach of the edges forming the metric graph. Consider metric graphs G7 and G8 in Figure 7. Since vol(G7) > vol(G8), the density q7 at a point in the tube around G7 is lower than the density q8 at a point around G8. TV = q7(H) q8(H) where H is the set where q7 > q8, the shaded area in the tube around G7. Note that q8(H) = 0 and TV (q7, q8) = q7(H) = vol(H) vol(G7) CD,4 τσD 1 (1 + τ)σD 1 CD,4τ. By Le Cam s lemma, Rn 1 8(1 CD,4τ)2n 1 8e 2CD,4τn for all τ τ0(D), where ξ0(D) is a constant depending on D. Note that, up to constants, the lower bound obtained in the tubular noise case is identical to the lower bound of Proposition 6 for the noiseless case. 4.2 Upper Bounds In this section we use the analysis of the performance of Algorithm 1 to derive an upper bound on the minimax risk. We will use the strategy of Niyogi et al. (2008) to find the Statistical Analysis of Metric Graph Reconstruction sample size that guarantees a δ/2-dense sample with high probability. We will use the following two lemmas. Lemma 8 (5.1 in Niyogi et al. 2008) Let {Ai} for i = 1, . . . , l be a finite collection of measurable sets and let µ be a probability measure on Sl i=1 Ai such that for all 1 i l, we have µ(Ai) > γ. Let x = {x1, . . . , xn} be a set of n i.i.d. draws according to µ. Then if log l + log 1 we are guaranteed that with probability > 1 λ, the following is true: i, x Ai = . Recall that the ϵ-covering number C(ϵ) of a set S is the smallest number of Euclidean balls of radius ϵ required to cover the set. The ϵ-packing number P(ϵ) is the maximum number of sets of the form B(x, ϵ) S, where x S, that may be packed into S without overlap. Lemma 9 (5.2 in Niyogi et al. 2008) For every ϵ > 0, P(2ϵ) C(2ϵ) P(ϵ). Combining Lemma 8 and Proposition 4, we obtain an upper bound on Rn for the noiseless case. Theorem 10 In the noiseless case (σ = 0), an upper bound on the minimax risk Rn is given by Rn 8 length(G) δ exp a δ n 4 length(G) 2 min ξ 2 sin(α/4) 3 sin(α/4) + 1 , τ sin(α/2) sin(α/4) sin(α/2) sin(α/4) + 3 sin(α/4) + 1 sin min{b, ατ} Proof In the noiseless case, Proposition 4 implies that the graph G can be reconstructed from a δ/2-dense sample Y if δ < min ξ 2 sin(α/4) 3 sin(α/4) + 1 , f(b, α, τ, ξ, 0) . (22) The value of δ selected in (21) satisfies condition (22), which follows from conditions (9) and (10), with σ = 0. We look for the sample size n that guarantees a δ/2-dense sample with high probability. Following the strategy in Niyogi et al. (2008), we consider a cover of the metric graph G by balls of radius δ/4. Let {xi : 1 i l} be the centers of such balls that constitute a minimal cover. We can choose Aδ/4 i = Bδ/4(xi) G. Applying Lemma 8 we find that the sample size that guarantees a correct reconstruction with probability at least 1 λ is 1 γ log l + log 1 Lecci, Rinaldo and Wasserman γ min i a length(Aδ/4 i ) length(G) aδ 4 length(G) , and we bound the covering number l in terms of the packing number, using Lemma 9: l length(G) mini length(Aδ/8 i ) 8 length(G) Therefore, from (23), if n = 4 length (G) log 8 length(G) we have a δ/2-dense sample with probability at least 1 λ and, by Proposition 4, P( b G G) λ. Rearranging we have the result. Note that, in the noiseless case, the upper and lower bounds are tight up to polynomial factors in the parameters τ, b, ξ. There is a small gap with respect to α; closing this gap is an open problem. In the tubular noise case, we assume that σ is small enough, to guarantee that Algorithm 1 correctly reconstructs a metric graph starting from a δ/2-dense sample. Theorem 11 Assume that σ satisfies condition (19) and 0 < σ < min{3τ/16, δ/8}, where δ = C0 min ξ 3σ τ sin(α/2) (τ σ) sin(α /2) 3/2 + [2 sin(α /4)] 1 , f(b, α, τ, ξ, σ) , (25) for some 0 < C0 < 1. Under the tubular noise model, an upper bound on the minimax risk Rn is given by Rn 16length(G) δ exp C Dδ(τ 8σ)n τ length(G) where C D is a constant depending on the ambient dimension. Proof Proposition 4 implies that the graph G can be reconstructed from a δ/2-dense sample Y if δ < min ξ 3σ τ sin(α/2) (τ σ) sin(α /2) 3/2 + [2 sin(α /4)] 1 , f(b, α, τ, ξ, σ) , (26) which is satisfied by the value of δ selected in (25). We look for the sample size n that guarantees a δ/2-dense sample in Gσ with high probability. We consider a cover of the metric graph G by Euclidean balls of radius δ/8. Let {xi : 1 i l} be the centers of such balls that constitute a minimal cover. Note that D-dimensional balls of radius δ/8+σ δ/4 centered at the same x is constitute a cover of the tubular region Gσ. We define Aδ/8+σ i = Bδ/8+σ(xi) Gσ. Applying Lemma 8 we find that the sample size Statistical Analysis of Metric Graph Reconstruction that guarantees a δ/2-dense sample in Gσ (and a correct topological reconstruction of G) with probability at least 1 λ is log l + log 1 γ = min i vol(Aδ/8+σ i ) vol(Gσ) . (28) Define Aδ i = Bδ(xi) G. The covering number l is bounded in terms of the packing number, using Lemma 9, l length(G) mini length( Aδ/16 i ) 16 length(G) We construct a lower bound on γ by deriving an upper bound on the denominator of (28) and a lower bound on the numerator. Upper bound on vol(Gσ). Let Nσ be the σ-covering number of G and let Cσ be the set of centers of this cover. By Lemma 9, Nσ is bounded by the σ/2-packing number. A simple volume argument gives Nσ Clength(G)/σ, for some constant C. Note that 2σ Ddimensional balls around each of the centers in Cσ cover Gσ. Thus vol(Gσ) v DNσ(2σ)D CDlength(G)σD 1 for some constant CD depending on the ambient dimension. Lower bound on vol(Aδ/8+σ i ), for all i. Let PA(σ) be the σ-packing number of Aδ/8 i and let DA be the set of centers of this packing. Then vol(Aδ/8+σ i ) PA(σ)v DσD, because the union of σ balls around DA is contained in Aδ/8+σ i . Let CA(2σ) be the 2σ-covering number of Aδ/8 i and let CA = {z1, . . . , z CA(2σ)} be the set of centers of this cover. By Lemma 9, PA(σ) CA(2σ) length( Aδ/8 i ) maxzj CA length(B2σ(zj) Aδ/8 i ) δ/8 maxzj CA length(B2σ(zj) Aδ/8 i ) and, since 2σ < 3τ/8, by Corollary 1.3 in Chazal (2013), max zj CA length(B2σ(zj) Aδ/8 i ) C2 for some constant C2. Thus γ PA(σ)v DσD CDlength(G)σD 1 C D δ(τ 8σ) τlength(G) , where C D is a constant depending on the ambient dimension. Finally, from (27), if n = τ length (G) log 16 length(G) Lecci, Rinaldo and Wasserman then the sample is δ/2-dense with probability at least 1 λ and P( b G G) λ. Rearranging we obtain Rn exp C Dδ(τ 8σ)n τ length(G) + log 16length(G) 5. Discussion In this paper, we presented a statistical analysis of metric graph reconstruction. We derived sufficient conditions on random samples from a graph metric space that guarantee topological reconstruction and we derived lower and upper bounds on the minimax risk for this problem. Various improvements and theoretical extensions are possible. In Proposition 4 we have analyzed Algorithm 1 using the Euclidean distance at every step. It is possible to obtain a similar result using a different notion of distance, for example, the distance induced by a Rips-Vietoris graph constructed on the sample. While in our analysis we mainly relied on the assumption of a dense sample, Aanjaneya et al. (2012) used the more refined but stronger assumption of the sample being an approximation of the metric graph, which we recall: given positive numbers ε and R, we say that (Y, d Y) is an (ε, R)-approximation of the metric space (G, d G) if there exists a correspondence C G Y such that (x, y), (x , y ) C, min(d G(x, x ), d Y(y, y )) R = d G(x, x ) d Y(y, y ) ε. (30) As shown in Aanjaneya et al. (2012), the (ε, R)-approximation assumption is sufficient, for appropriate choice of the parameters ε and R, to recover not only the topology of a metric graph (G, d G), but also its metric d G with high accuracy. However, when compared to the dense sample assumption, it demands a larger sample complexity to achieve accurate topological reconstruction. A strategy similar to the one used in this paper could be used to determine the sample size that guarantees an (ε, R)-approximation of the underlying metric graph with high probability. This would guarantee a correct topological reconstruction, as well as an approximation of the metric d G. We are also investigating the idea of combining metric graph reconstruction with the subspace constrained mean-shift algorithm (Fukunaga and Hostetler, 1975; Comaniciu and Meer, 2002; Genovese et al., 2012b) to provide similar guarantees. Our preliminary results indicate that this mixed strategy works very well under more general noise assumptions and with relatively low sample size. Acknowledgments Research supported by NSF CAREER Grant DMS 114967, Air Force Grant FA95500910373, NSF Grant DMS-0806009. The authors thank the referees for helpful comments and suggestions. Statistical Analysis of Metric Graph Reconstruction Mridul Aanjaneya, Fr ed eric Chazal, Daniel Chen, Marc Glisse, Leonidas Guibas, and Dmitriy Morozov. Metric graph reconstruction from noisy data. International Journal of Computational Geometry & Applications, 22(04):305 325, 2012. Mahmuda Ahmed and Carola Wenk. Probabilistic street-intersection reconstruction from gps trajectories: approaches and challenges. In Proceedings of the Third ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio-Temporal Data, pages 34 37. ACM, 2012. Ery Arias-Castro, Guangliang Chen, and Gilad Lerman. Spectral clustering based on local linear approximations. Electronic Journal of Statistics, 5:1537 1587, 2011. Giorgio A Ascoli, Duncan E Donohue, and Maryam Halavi. Neuromorpho. org: a central resource for neuronal morphologies. The Journal of Neuroscience, 27(35):9247 9251, 2007. Paul Bendich, Sayan Mukherjee, and Bei Wang. Towards stratification learning through homology inference. ar Xiv preprint ar Xiv:1008.3572, 2010. Paul Bendich, Bei Wang, and Sayan Mukherjee. Local homology transfer and stratification learning. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1355 1370. SIAM, 2012. Fr ed eric Chazal. An upper bound for the volume of geodesic balls in submanifolds of Euclidean spaces. Technical report, INRIA, January 2013. Fr ed eric Chazal and Andr e Lieutier. Topology guaranteeing manifold reconstruction using distance function to noisy data. In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, pages 112 118. ACM, 2006. Fr ed eric Chazal and Jian Sun. Gromov-Hausdorffapproximation of metric spaces with linear structure. ar Xiv preprint ar Xiv:1305.1172, 2013. Fr ed eric Chazal, David Cohen-Steiner, and Andr e Lieutier. A sampling theory for compact sets in Euclidean space. Discrete & Computational Geometry, 41(3):461 479, 2009. Daniel Chen, Leonidas J Guibas, John Hershberger, and Jian Sun. Road network reconstruction for organizing paths. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1309 1320. SIAM, 2010. Alexey Chernov and Vitaliy Kurlin. Reconstructing persistent graph structures from noisy images. Electronic Journal Image-A, 3(5):19 22, 2013. Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603 619, 2002. Herbert Federer. Curvature measures. Transactions of the American Mathematical Society, 93(3):418 491, 1959. Lecci, Rinaldo and Wasserman Keinosuke Fukunaga and Larry Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. Information Theory, IEEE Transactions on, 21(1):32 40, 1975. Xiaoyin Ge, Issam I Safa, Mikhail Belkin, and Yusu Wang. Data skeletonization via Reeb graphs. In Advances in Neural Information Processing Systems, pages 837 845, 2011. Christopher R Genovese, Marco Perone-Pacifico, Isabella Verdinelli, and Larry Wasserman. Minimax manifold estimation. Journal of Machine Learning Research, 13:1263 1291, 2012a. Christopher R Genovese, Marco Perone-Pacifico, Isabella Verdinelli, and Larry Wasserman. Nonparametric ridge estimation. ar Xiv preprint ar Xiv:1212.5156, 2012b. Attila I Guly as, Manuel Meg ıas, Zsuzsa Emri, and Tam as F Freund. Total number and ratio of excitatory and inhibitory synapses converging onto single interneurons of different types in the ca1 area of the rat hippocampus. The Journal of Neuroscience, 19(22):10082 10097, 1999. Peter Kuchment. Quantum graphs: I. Some basic structures. Waves in Random Media, 14 (1):107 128, 2004. Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39(1-3):419 441, 2008. Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008. Bin Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423 435. Springer, 1997.