# crystallization_learning_with_the_delaunay_triangulation__2bc90f1e.pdf Crystallization Learning with the Delaunay Triangulation Jiaqi Gu 1 Guosheng Yin 1 Abstract Based on the Delaunay triangulation, we propose the crystallization learning to estimate the conditional expectation function in the framework of nonparametric regression. By conducting the crystallization search for the Delaunay simplices closest to the target point in a hierarchical way, the crystallization learning estimates the conditional expectation of the response by fitting a local linear model to the data points of the constructed Delaunay simplices. Instead of conducting the Delaunay triangulation for the entire feature space which would encounter enormous computational difficulty, our approach focuses only on the neighborhood of the target point and thus greatly expedites the estimation for high-dimensional cases. Because the volumes of Delaunay simplices are adaptive to the density of feature data points, our method selects neighbor data points uniformly in all directions and thus is more robust to the local geometric structure of the data than existing nonparametric regression methods. We develop the asymptotic properties of the crystallization learning and conduct numerical experiments on both synthetic and real data to demonstrate the advantages of our method in estimation of the conditional expectation function and prediction of the response. 1. Introduction Consider a regression model, yi = µ(xi) + ϵi, i = 1, . . . , n, (1) where xi is a d-dimensional feature point in the Euclidean space Rd (n > d), yi is the observed response, µ( ) = E(Y | ) is the conditional expectation function of the response Y , and ϵ1, . . . , ϵn R are independent and identically distributed (i.i.d.) random errors with E(ϵi) = 0 1Department of Statistics and Actuarial Science, University of Hong Kong, Hong Kong SAR. Correspondence to: Jiaqi Gu , Guosheng Yin . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). and E(ϵ2 i ) < . Nonparametric regression is a collection of methods for estimating the conditional expectation function µ( ) without rigid assumptions on its shape. Recent decades have witnessed extensive research in the field of nonparametric regression, including nearest-neighbor regression (Nadaraya, 1964; Watson, 1964; Cover & Hart, 1967; Benedetti, 1977; Stone, 1977; Altman, 1992), kernel regression (Priestley & Chao, 1972; Hardle & Gasser, 1984; Hein, 2009) and local linear regression (Cleveland, 1979; Cleveland & Devlin, 1988; Fan & Gijbels, 2018). Although the consistency of these methods has been established under mild conditions, their finite-sample performances are sensitive to the local geometric structure of observed feature points. As these methods only consider the distances from the target point z to observed feature points x1, . . . , xn in computing the neighbor data points or assigning weights, it is possible that the directions from z to its neighbors are not uniformly distributed, especially when z is close to the boundary of the convex hull of feature points or jump points of the feature data density. As a result, the (weighted) mean of neighbor data points may be far from the target point z, leading to large bias in estimating the conditional expectation µ(z). By incorporating the Delaunay triangulation (Delaunay, 1934) into the framework of nonparametric regression, we propose the crystallization learning which mimics the crystallization process in thermodynamics and circumvents the curse-of-dimensionality issue in the Delaunay triangulation. Based on the DELAUNAYSPARSE algorithm (Chang et al., 2020) which locally constructs the Delaunay simplex S(z) containing the target point z, we develop the crystallization search for the Delaunay simplices closest to S(z) and estimate µ(z) by fitting a local linear model to the data points of the obtained Delaunay simplices. Via experiments on synthetic and real data, our method is shown to outperform the existing ones in estimating the conditional expectation function and predicting the response. 2. Methodology 2.1. Delaunay Interpolation Let X be a set of n feature points x1, . . . , xn in the Euclidean space Rd (n > d). A d-dimensional triangulation of X, T (X), is a mesh of d-simplices {S1, . . . , Sm} satisfying: Crystallization Learning Figure 1. (a) Graphical illustration of the empty-ball property of the Delaunay triangulation; (b) the Delaunay triangulation; (c) a random triangulation. 1. For j = 1, . . . , m, the set of d + 1 vertices of simplex Sj, denoted as V(Sj), is a subset of X and does not lie in any affine hyperplane of Rd. 2. For any j = k, simplices Sj and Sk are disjoint except on their shared boundaries Sj Sk. 3. The union S1 Sm is the convex hull of X, H(X). As the d-simplices S1, . . . , Sm of the triangulation T (X) fully cover the convex hull H(X), for each internal point z H(X), there exists a simplex S(z) T (X) such that z S(z). Let i1(z), . . . , id+1(z) denote the indices corresponding to the data points of S(z), and then there exist d+1 values γ1, . . . , γd+1 [0, 1] such that Pd+1 k=1 γkxik(z) = z and Pd+1 k=1 γk = 1. Among all triangulations, the Delaunay triangulation has been widely used for multivariate interpolation (de Berg et al., 2008) due to its smoothness property. Let Bj be the open ball whose boundary is the circumscribed (d 1)-sphere of Sj. The Delaunay triangulation of X, denoted as DT (X), is a triangulation of X such that Bj X = for j = 1, . . . , m. This is known as the empty-ball property as shown in Figure 1 (a). As the geometric dual of the Voronoi diagram under the L2 norm, the Delaunay triangulation generates a mesh of simplices that are most regularized in shape. In a 2-dimensional space, X R2, the Delaunay triangulation DT (X) maximizes the minimum angle in all the triangles (2-simplices) S1, . . . , Sm over all possible triangulations (Sibson, 1978), as shown in Figure 1 (b) and (c). The Delaunay triangulation DT (X) is unique under the assumption that X is in general position (Delaunay, 1934). Considering the data {(xi, yi) : i = 1, . . . , n} from model (1), the Delaunay interpolation aims to estimate the conditional expectation µ(z) for all z H(X). Generally, there are three steps in the Delaunay interpolation: (i) construct the Delaunay triangulation DT (X); (ii) find the simplex S(z) DT (X); and (iii) obtain the estimator ˆµ(z) by optimizing a target function. For most of Delaunay interpolation methods, the first two steps are the same, while the difference mainly lies in the target function. For example, with γ1, . . . , γd+1 [0, 1] such that Pd+1 k=1 γkxik(z) = z and Algorithm 1 DELAUNAYSPARSE (Chang et al., 2020) 1: Input: Feature points X, target point z H(X) and the seed Delaunay simplex Sseed. 2: Let Scurrent = Sseed, AFrontier = {Sseed}, AExplored = . 3: while z / Scurrent do 4: Compute the set of facets of Scurrent which is visible to z1, denoted as Fz(Scurrent). 5: for each facet F Fz(Scurrent) do 6: Grow a new Delaunay simplex Snew = Scurrent on the facet F if it exists. 7: AFrontier AFrontier {Snew} if Snew exists and Snew / AExplored AFrontier. 8: end for 9: AExplored AExplored {Scurrent}. 10: AFrontier AFrontier \ {Scurrent}. 11: Scurrent the first simplex in AFrontier. 12: end while 13: Output: Simplex Scurrent. Pd+1 k=1 γk = 1, the estimator of de Berg et al. (2008) is k=1 γkyik(z), (2) which is the minimizer of the squared loss function Pn i=1(yi g(xi))2 among all continuous piecewise linear functions, g(x) = Pm j=1 1{x Sj}(αj + βT j x). Liu & Yin (2020) introduce a regularization function to balance the model fitting and smoothness of the estimator. However, all the aforementioned methods require a complete construction of DT (X) for the entire feature space, whose size (i.e., m) grows exponentially with the dimension d. As a result, no existing algorithm is feasible when d > 7 due to the limitations of computation time/power and memory space (Chang et al., 2020). Alternatively, several methods have been proposed for mediumto high-dimensional Delaunay interpolation (Chang et al., 2018a;b; 2020). Instead of obtaining the complete DT (X), these methods only construct the Delaunay simplex S(z) at each target point z locally and thus µ(z) can be estimated at a polynomial cost. For any point z H(X), the DELAUNAYSPARSE algorithm (Chang et al., 2020) first obtains a seed Delaunay simplex Sseed close to z. Based on Sseed, Chang et al. (2020) find S(z) via the breadth first search as described in Algorithm 1 and compute the estimator in (2). Although such an approach is computationally efficient because γ1, . . . , γd+1 are simultaneously calculated, it only utilizes the information of d + 1 data points {(xik(z), yik(z)) : k = 1, . . . , d + 1} in the esti- 1A facet F of the simplex Scurrent is visible to z if there exists an internal point z of Scurrent such that the linear segment from z to z intersects F (Chang et al., 2020). Crystallization Learning Algorithm 2 Crystallization search 1: Input: Feature points X, target point z H(X) and topological distance L. 2: Compute S(z) via Algorithm 1. 3: Let AFrontier = {(S(z), 0)} and NL(z) = . 4: while AFrontier = do 5: (Scurrent, Lcurrent) the first element in AFrontier. 6: if Lcurrent < L then 7: Compute all the facets of Scurrent, denoted as F1, . . . , Fd+1. 8: for j = 1, . . . , d + 1 do 9: Grow a new Delaunay simplex Snew = Scurrent on the facet Fj if it exists. 10: AFrontier AFrontier {(Snew, Lcurrent + 1)} if Snew exists and Snew / NL(z) AFrontier. 11: end for 12: end if 13: NL(z) NL(z) {Scurrent}. 14: AFrontier AFrontier \ {(Scurrent, Lcurrent)}. 15: end while 16: Output: The set of Delaunay simplices NL(z). mation. This may lead to overfitting and poor estimation when the simplex S(z) has a small volume and a poorly regularized shape. 2.2. Crystallization Search for Delaunay Simplices As one component of DT (X), S(z) has d + 1 facets F1, . . . , Fd+1, each of which is either a facet of H(X) or the shared boundary of S(z) and one neighbor Delaunay simplex. Definition 1. Neighbor Delaunay simplices: Given a set of points X and the Delaunay triangulation DT (X) = {S1, . . . , Sm}, simplices Sj and Sk are neighbors if and only if the intersection Sj Sk is a shared facet of Sj and Sk. Inspired by Algorithm 1 (Chang et al., 2020) which searches S(z) by growing neighbor Delaunay simplices on the facets of the explored ones recursively, we develop the crystallization search (Algorithm 2) to construct all the Delaunay simplices within the topological distance L to S(z), denoted as NL(z). Figures 2 and 3 display the crystallization search of NL(z) with respect to a target point z H(X) for L = 0, 1, . . . , 5 in R2 and R3, respectively. When L = 0, only the simplex S(z) is constructed. As L increases, Delaunay simplices are constructed in a hierarchical way, such that new simplices would grow on the facets of the explored ones sequentially. The whole process of crystallization search is analogous to the crystallization process in thermodynamics, where the search of S(z) indexed by line 2 in Algorithm 2 plays the role of nucleation and the remaining Figure 2. Crystallization search of NL(z) with respect to a target point z H(X) for L = 0, 1, 2 (top row) and L = 3, 4, 5 (bottom row) in R2. Figure 3. Crystallization search of NL(z) with respect to a target point z H(X) for L = 0, 1, 2 (top row) and L = 3, 4, 5 (bottom row) in R3. steps correspond to crystal growth. 2.3. Crystallization Learning Without loss of generality, let Vz,L = S NL(z)V(S) denote the set of all the data points of the simplices in NL(z). Based on the set NL(z) of Delaunay simplices topologically closest to the target point z, we propose the crystallization learning to estimate µ(z) by fitting a local linear model, µ(z) = α + βTz, to all the data points in Vz,L instead of only the d + 1 data points of S(z). In DT (X), a vertex shared by more simplices typically has a larger degree in the network formed by Delaunay edges and thus is more informative in the geometric structure of NL(z). We estimate α and β via the weighted least squares approach, (ˆα, ˆβ) = arg min α,β xi Vz,L wz,L(xi)(yi α βTxi)2, (3) Crystallization Learning where the weight function is S NL(z) 1{xi V(S)} xi z 2 2 m L(z) with m L(z) = xi Vz,L xi z 2 2 i=1 1{xi Vz,L} Given the estimators ˆα and ˆβ, we have ˆµ(z) = ˆα + ˆβ Tz. Similar to the work of Nadaraya (1964) and Watson (1964), our weight function places more weights on the data points closer to z as well as those shared by more simplices in NL(z). For all xi / Vz,L, the weights are set to be zero. In addition, our weight function is scale-invariant due to the existence of the normalization term m L(z), i.e., multiplying any constant to features would not change the weights. As a result, the estimated conditional expectation function, ˆµ( ), is only piecewise smooth but not piecewise linear in H(X), as demonstrated by Theorem 1 with the proof given in the supplementary materials. Theorem 1. Let X be a set of n feature points x1, . . . , xn in general position and responses y1, . . . , yn are generated from model (1). The estimated conditional expectation function under crystallization learning, ˆµ( ), is smooth in Sk, for k = 1, . . . , m. 2.4. Selection of L Similar to many other machine learning methods, the statistical complexity and estimation performance of the crystallization learning is controlled by the hyperparameter L, the maximal topological distance from the generated neighbor Delaunay simplices to S(z). Because a small L leads to overfitting and a large L makes ˆµ( ) overly smooth, we propose adopting the leave-one-out cross validation (LOO-CV) to select L with respect to the target point z as follows. 1. Compute the Delaunay simplex S(z) containing z and the values of γ1, . . . , γd+1 [0, 1] such that Pd+1 k=1 γkxik(z) = z and Pd+1 k=1 γk = 1 via Algorithm 1, where xi1(z), . . . , xid+1(z) are the d + 1 data points of S(z). 2. For each xik(z) S(z), apply the crystallization learning with different candidate values of L on the leave-one-out data excluding (xik(z), yik(z)) to estimate µ(xik(z)). Let ˆµ(xik(z); L) be the estimator corresponding to the observation (xik(z), yik(z)) and candidate value L. 3. Select the optimal e L as e L = arg min L k=1 γk log{ˆµ(xik(z); L) yik(z)}2. 2.5. Computational Complexity As shown by Chang et al. (2020), the average computational complexity of Algorithm 1 is O(d2n). However, as Algorithm 1 is only implemented once, the dominant cost of Algorithm 2 lies in growing simplices (lines 4 15). In Algorithm 2, the number of generated Delaunay simplices is O(d L) and the average computational complexity of growing a new Delaunay simplex on the facet of Scurrent is O(n) with the rank-1 update suggested by Chang et al. (2020). Thus, the average computational complexity of Algorithm 2 is O(d Ln). Table 1 shows the average runtime in computing NL(z) under different configurations. Table 1. Average runtime (in seconds) in computing NL(z) under different values of the maximal topological distance L, sample size n, dimension d. L n = 500 n = 1000 n = 2000 d = 6 8 10 d = 6 8 10 d = 6 8 10 2 0.05 0.09 0.14 0.06 0.11 0.18 0.07 0.14 0.23 3 0.23 0.51 0.98 0.28 0.63 1.22 0.34 0.80 1.56 4 0.82 2.26 5.20 1.02 2.80 6.51 1.21 3.55 8.27 3. Connection with Other Nonparametric Regression Methods Similar to the two popular nonparametric regression methods, i.e., the nearest neighbor and the local linear regression, our crystallization learning consists of three steps in estimating the conditional expectation µ(z) at the target point z: (i) selecting data points from X as the neighbors of z according to a specific criterion; (ii) assigning weights to the selected neighbor data points; and (iii) fitting a local model to the selected neighbor data points. As the crystallization learning and existing methods mainly differ in the first two steps, we compare our method with the k-nearest neighbor (k-NN) regression and the local linear regression in selecting the neighbor data points. Our crystallization search of neighbor data points is based on the Delaunay triangulation, which is the geometric dual of the Voronoi diagram under the L2 norm. As a result, we use the Euclidean distance in k-NN and the Gaussian kernel in the local linear regression. To find the k nearest neighbor data points, the k-NN regression computes and sorts the Euclidean distances from the target point z to all the data points in X. This process can be visualized by the left panel of Figure 4, where a circle with center z is shown. The radius keeps increasing until there are k observed data points falling inside or on the circle, which are returned as the k nearest neighbor points. Because only the Euclidean distance is considered, it is likely that the directions from z to the k nearest neighbor points are not uniformly distributed, especially when z is close to the boundary of H(X) or jump points of the feature data Crystallization Learning Figure 4. Neighbor data points of the target point z selected by the k-NN regression with k = 5, 10, 15, 20 (left panel) and the crystallization learning with L = 0, 1, 2, 3 (right panel). (a) Crystallization (c) Local linear Figure 5. Kernel density estimates of distributions of the directions from the target point z to its neighbor data points using different methods with different hyperparameter values. The arrow indicates the direction from the target point z to the sample mean of X. Figure 6. Paths of the (weighted) means of neighbor data points by different methods as the value of the hyperparameter increases under feature data density (4). density. The same is true for the local linear regression, where more weights are assigned in the direction toward the sample mean. In contrast, the crystallization search identifies neighbor data points Vz,L by constructing Delaunay simplices, whose volumes are adaptive to the density of observed feature data. As a result, the distances from z to neighbor data points in Vz,L are different for high-density and low-density directions. This can be seen from Figure 4, where we generate x1, . . . , x100 R2 from the density function, j=1 {1 + 0.6 sign(xj)} exp( x2 j/2), (4) and use different methods to select the neighbor data points of z = (0, 1)T. The density function f(x) is discontinuous at z with higher density at its right-hand side than its lefthand side. From the left panel of Figure 4, we can see that for all values of k, k-NN identifies more neighbor points at the right-hand side of z than the left-hand side. However, this is not the case for the crystallization learning as exhibited in the right panel of Figure 4. As L increases, the crystallization learning searches neighbor data points uniformly in all directions, implying its adaptation to the local geometric structure of the data. This can also be observed in Figure 5, where kernel density estimates (KDEs) of distributions of the directions from the target point z to its neighbor data points are plotted. The neighbor data points selected by k-NN and the weights assigned by the local linear regression concentrate in the direction toward Crystallization Learning the sample mean of X, while KDEs of the crystallization search are much closer to the uniform distribution. As a result, the (weighted) means of neighbor data points under the crystallization search are closer to the target point z than existing methods as shown in Figure 6. 4. Asymptotic Theory We first study the asymptotic geometric properties of the Delaunay triangulation DT (X) under general distribution of feature points and then prove the consistency of the crystallization learning in estimating µ( ). All proofs are given in the supplementary materials. 4.1. Asymptotic Geometric Properties Let X be a set of n i.i.d. feature data points x1, . . . , xn Rd from a density f(x), which is bounded away from zero and infinity on Rd. Lemma 1. For any target point z Rd, we have that P(z H(X)) 1, as n . By Lemma 1, the target point z falls inside H(X) with asymptotic probability one, and thus we only need to consider the inside-hull case. Theorem 2. For any target point z H(X) and any ρ (0, 1), we have T(z) = Op(n ρ/d), where T(z) = max{ xi z 2; xi Vz,L} is the maximal L2 norm between z and its neighbor feature data points. Theorem 2 implies that all the feature points of Vz,L converge to z in probability. 4.2. Consistency Theorem 3. Assume the data {(xi, yi) : i = 1, . . . , n} are generated from model (1), where the conditional expectation function µ( ) is differentiable on Rd. The estimated conditional expectation function under crystallization learning, ˆµ( ), satisfies E ˆµ(z) µ(z) 2 Rmin, as n , for all z H(X) where Rmin = infg E{Y g(x)}2 is the minimal value of the L2 risk over all continuous functions g : Rd R. 5. Numerical Experiments We conduct experiments on synthetic data under two different scenarios: (i) to illustrate the effectiveness of the crystallization learning in estimating the conditional expectation function µ( ); (ii) to evaluate the estimation accuracy of our approach in comparison with existing nonparametric regression methods, including the k-NN regression using the Euclidean distance, local linear regression using Gaussian kernel, multivariate kernel regression using Gaussian kernel (Hein, 2009) and Gaussian process models; and (iii) to validate the proposed data-driven procedure for selection of the hyperparameter L. We also apply our method to real data to investigate its empirical performance. 5.1. Synthetic Data In the experiments on synthetic data, we consider two scenarios to investigate the performance of our crystallization learning: (1) general internal points of H(X), and (2) jump points of the feature data density. For each scenario, we simulate 100 training datasets {(xi, yi) : i = 1, . . . , n} and randomly generate the corresponding sets of target points {z1, . . . , z100} under different values of n and d. We use the mean squared error (MSE) under the method M, MSEM = 1 100 k=1 {ˆµM(zk) µ(zk)}2, to evaluate the accuracy of the estimator ˆµM( ) at the target points z1, . . . , z100 H(X). Scenario 1 (General internal points): For each dataset, x1, . . . , xn are independently sampled from the multivariate normal distribution MVN(0, Id) with an identity covariance matrix Id. The responses y1, . . . , yn are generated from an additive model, j=1 cjgj(xj), 1 where x = (x1, . . . , xd)T, c1, . . . , cd N(0, 1), gj( ) = P10 l=1 bjlφ( ; νjl, σ2 jl), bjl N(0, 1), νjl N(0, 1), σ2 jl Gamma(1, 1), and φ( ; νjl, σ2 jl) is the density of a normal distribution N(νjl, σ2 jl), for j = 1, . . . , d; l = 1, . . . , 10. For k = 1, . . . , 100, the target point zk is generated as zk = Pn i=1 ωikxi, with (ω1k, . . . , ωnk) Dirichlet(1, . . . , 1). Scenario 2 (Jump points of the feature data density): For each dataset, x1, . . . , xn are sampled from the density, f(x) = 2 d d Y j=1 (1 + 0.4 sign(xj)) exp( |xj|), which has jumps at the point set {x Rd : Qd j=1 xj = 0}. Responses y1, . . . , yn are generated from the same additive model in (5). For k = 1, . . . , 100, the target point zk is generated as zk = Pn i=1 ωikxi sk, where (ω1k, . . . , ωnk) Dirichlet(1, . . . , 1), is the elementwise multiplication operator, sk = (sk1, . . . , skj)T and sk1, . . . , skj Bernoulli(0.7). Crystallization Learning Table 2. Averaged values of log(MSE) and standard deviations in parentheses using crystallization learning (CL) in comparison with k-NN (k = 5, 10, k , where k equals the size of Vz,L), local linear (LL) regression, kernel regression (KR) and Gaussian process (GP) in estimating µ( ) under two scenarios, different sample sizes (n) and different dimensions of the feature space (d). d n log(MSECL) log MSE5-NN MSECL log MSE10-NN MSECL log MSEk -NN MSECL log MSELL MSECL log MSEKR MSECL log MSEGP Scenario 1 (General internal points) 200 -1.11(0.21) 0.23(0.09) 0.12(0.09) 0.33(0.11) 0.56(0.11) 0.57(0.11) 0.24(0.18) 500 -2.13(0.18) 0.55(0.13) 0.37(0.11) 0.45(0.13) 0.91(0.17) 0.94(0.17) 0.76(0.18) 1000 -2.04(0.18) 0.53(0.13) 0.42(0.13) 0.62(0.12) 1.18(0.19) 1.22(0.19) 0.41(0.20) 2000 -2.21(0.20) 0.48(0.14) 0.38(0.14) 0.59(0.16) 1.06(0.22) 1.08(0.21) 0.81(0.17) 200 -0.03(0.16) 0.28(0.09) 0.13(0.07) 0.14(0.08) 0.10(0.07) 0.12(0.07) -0.08(0.14) 500 0.01(0.21) 0.43(0.13) 0.31(0.10) 0.29(0.11) 0.47(0.12) 0.47(0.12) -0.01(0.17) 1000 -0.50(0.22) 0.37(0.14) 0.30(0.12) 0.43(0.10) 0.54(0.12) 0.53(0.12) -0.09(0.21) 2000 -0.67(0.20) 0.42(0.13) 0.33(0.12) 0.51(0.11) 0.59(0.16) 0.60(0.16) 0.10(0.14) 200 1.46(0.14) 0.14(0.08) -0.02(0.06) -0.01(0.06) -0.02(0.03) -0.04(0.06) 0.17(0.15) 500 1.09(0.15) 0.25(0.10) 0.11(0.07) -0.01(0.07) -0.07(0.06) -0.03(0.06) -0.18(0.16) 1000 0.92(0.18) 0.48(0.11) 0.36(0.10) 0.00(0.11) -0.10(0.08) -0.02(0.08) 0.22(0.18) 2000 0.73(0.22) 0.24(0.15) 0.24(0.12) 0.06(0.11) 0.18(0.11) 0.14(0.11) 0.15(0.19) 50 500 2.47(0.14) 0.08(0.09) -0.02(0.07) 0.02(0.05) -0.01(0.03) -0.08(0.11) 0.06(0.19) 1000 2.32(0.17) 0.08(0.12) -0.02(0.10) 0.04(0.06) -0.03(0.03) -0.13(0.12) -0.22(0.18) 2000 2.12(0.17) 0.17(0.13) 0.18(0.10) -0.01(0.06) 0.02(0.04) 0.00(0.11) -0.08(0.19) Scenario 2 (Jump points of the feature data density) 200 -0.72(0.17) 0.34(0.05) 0.33(0.04) 0.51(0.06) 0.60(0.07) 0.70(0.07) 0.32(0.10) 500 -1.46(0.15) 0.42(0.05) 0.31(0.05) 0.44(0.06) 0.92(0.09) 1.03(0.09) 0.59(0.11) 1000 -1.94(0.13) 0.48(0.06) 0.21(0.05) 0.33(0.07) 0.99(0.10) 1.11(0.10) 0.92(0.11) 2000 -1.87(0.17) 0.46(0.05) 0.26(0.05) 0.33(0.06) 1.43(0.11) 1.53(0.11) 1.10(0.11) 200 0.59(0.12) 0.08(0.05) 0.03(0.04) 0.17(0.04) 0.09(0.03) 0.13(0.03) 0.14(0.09) 500 0.44(0.14) 0.18(0.04) 0.08(0.04) 0.05(0.04) 0.09(0.04) 0.15(0.04) -0.07(0.08) 1000 0.27(0.11) 0.18(0.05) 0.11(0.04) 0.18(0.04) 0.29(0.05) 0.38(0.05) -0.11(0.07) 2000 0.02(0.13) 0.23(0.04) 0.11(0.04) 0.17(0.04) 0.43(0.05) 0.49(0.05) -0.12(0.07) 200 1.92(0.12) 0.08(0.04) 0.03(0.03) 0.02(0.02) -0.01(0.01) -0.04(0.03) 0.04(0.07) 500 1.77(0.10) 0.14(0.05) 0.01(0.03) -0.02(0.03) -0.01(0.04) -0.02(0.02) -0.07(0.07) 1000 1.68(0.13) 0.08(0.05) 0.02(0.03) -0.05(0.03) -0.04(0.02) -0.03(0.03) -0.09(0.06) 2000 1.50(0.12) 0.11(0.05) 0.06(0.03) 0.08(0.03) 0.02(0.02) 0.09(0.03) -0.11(0.07) 50 500 2.85(0.09) 0.16(0.06) 0.05(0.04) -0.01(0.04) 0.09(0.03) 0.14(0.06) -0.04(0.08) 1000 2.90(0.09) 0.20(0.05) 0.08(0.04) -0.03(0.02) 0.03(0.02) 0.19(0.06) -0.10(0.07) 2000 2.82(0.10) 0.15(0.04) 0.08(0.03) -0.01(0.01) -0.01(0.01) 0.10(0.04) -0.12(0.07) In both scenarios, we apply the crystallization learning and existing methods to estimate µ(z) = Pd j=1 cjgj(zj) at the target points z1, . . . , z100. We implement the crystallization learning with L = 3 for d = 5, 10 and L = 2 for d = 20, 50, and obtain ˆµ(z1), . . . , ˆµ(z100). We implement the k-NN regression with k = 5, 10, k , where k equals the size of Vz,L, and the local linear regression and kernel regression with bandwidth h = 1. Table 2 presents the estimation results averaged over 100 simulations under two scenarios with different values of n and d. For each d, the estimation accuracy of the crystallization learning improves as the sample size increases, indicating its consistency in estimating µ( ) in the convex hull H(X). For lower dimensional cases (d = 5, 10), the crystallization learning generally outperforms the existing methods, demonstrating that our approach is more efficient. For the higher dimensional cases (d = 20, 50), although our method cannot completely dominate the existing ones, the performances of different approaches are comparable. The results under Scenario 2 suggest the robustness of our method to the variation or sudden change in the feature data density. Overall, the crystallization learning performs well and is stable in estimating µ( ) at general internal points of H(X) and jump points of the feature data density. 5.2. Real Data Application We apply the crystallization learning to several real datasets from the UCI repository. The critical assessment of pro- Crystallization Learning (a) CASP dataset (b) Concrete dataset (c) Parkinson s motor UPDRS (d) Parkinson s total UPDRS Figure 7. Boxplots of log(MPSEM/MPSECL) corresponding to k-NN (k = 5, 10, k , where k equals the size of Vz,L), local linear (LL) regression, kernel regression (KR) and Gaussian process (GP) in estimating µ( ) under different datasets and sizes of the training set (n). tein structure prediction (CASP) dataset2 (Betancourt & Skolnick, 2001) contains experimental records on protein structure prediction. The CASP dataset includes 45730 records of 9 features, where the response is the root mean squared deviation (RMSD) of the residues. The Concrete dataset3 (Yeh, 1998) consists of 1030 experimental records of concrete compressive strength measurement. We use the content of 7 concrete ingredients and the age of a concrete sample to predict its compressive strength. Parkinson s telemonitoring dataset4 (Tsanas et al., 2010) is composed of 5875 voice recordings of 16 biomedical voice measures from 42 patients with early-stage Parkinson s disease in a six-month trial. We use these 16 biomedical voice measures 2https://archive.ics.uci.edu/ml/datasets/Physicochemical +Properties+of+Protein+Tertiary+Structure 3https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive +Strength 4https://archive.ics.uci.edu/ml/datasets/Parkinsons to predict the motor and total UPDRS (unified Parkinson s disease rating scale) scores. For each dataset, we take 100 bootstrap samples without replacement of size n (n = 200, 500, 1000 or 2000) for training and 100 bootstrap samples of size 100 for testing. To eliminate the impact of feature correlations and scales, we standardize the principal components of features in the training set and take them as the feature points x1, . . . , xn. The same transformation is applied to the features in the testing set to obtain the target points z1, . . . , z100. We take L = 3 for crystallization learning, and implement the k-NN regression with k = 5, 10, k , where k equals the size of Vz,L, and the local linear regression and kernel regression with bandwidth h = 1. Based on the testing set, we quantify the performance of the method M by the mean predictive Crystallization Learning squared error (MPSE), MPSEM = 1 100 k=1 {ˆµM(zk) yk}2, where yk s are responses corresponding to zk s. Figure 7 shows the comparison results averaged over 100 bootstrap samples between our method and existing ones under different datasets and sizes of the training set (n). It is clear that as n increases, the advantage of the crystallization learning over the existing methods amplifies. Overall, the crystallization learning dominates all the existing methods in most of the cases. 5.3. Selection of L To examine the data-driven selection procedure for L proposed in Section 2.4, we conduct experiments under Scenario 1 of Section 5.1. With candidate values L = 1, . . . , 8 and d = 5, we simulate 100 training datasets with sample sizes n = 200, 500, 1000, 2000 respectively and generate the corresponding sets of target points. Figure 8 shows the estimation results of our method averaged over 100 simulations under different sample sizes, when using different candidate values of L and the selected e L. It is clear that as the sample size n increases, the optimal value of L, which results in the smallest averaged value of log(MSEL), increases. This is reasonable because a larger n would lead to smaller volumes of simplices in NL(z) and thus a larger L is needed for more accurate estimation. In addition, the averaged value of log(MSEe L) is closer to the smallest averaged value of log(MSEL) when n is larger, suggesting the effectiveness of our LOO-CV procedure in improving the estimation accuracy. 6. Conclusions The Delaunay triangulation is a powerful tool to partition the feature space in a data-driven way, which has the least roughness for smooth surface reconstruction. We incorporate the Delaunay triangulation into the framework of nonparametric regression and develop the crystallization learning procedure. Without the need to triangulate the entire feature space which becomes infeasible for high-dimensional cases, our method conducts the Delaunay triangulation locally at each specific target point like crystal growth. The conditional expectation µ(z) at the target point z H(X) is estimated by fitting a local linear model to the data points of the Delaunay simplices identified by the crystallization search. Compared with existing nonparametric regression methods, our method is more adaptive to the local geometric structure of the data, which selects the neighbor data points uniformly in all directions and their weighted mean is closer to the target point z. Both theoretical studies and numerical experiments show (a) n = 200 (b) n = 500 (c) n = 1000 (d) n = 2000 Figure 8. Averaged values of log(MSEL) log(MSE) (L = 1, . . . , 8) and log(MSEe L) log(MSE) under different sample sizes (n), where MSEL is the MSE using the hyperparameter L and log(MSE) = P8 L=1 log(MSEL)/8. that the crystallization learning is consistent in estimating µ( ) and it generally outperforms the existing methods. Given that the crystallization learning is a local approach, it is possible to combine it with the uniform design and develop a global version of crystallization learning in a hierarchical way. As our method searches Nz,L in a deterministic way, we can also develop the stochastic crystallization search to reduce the boundary effect on the estimated conditional expectation function ˆµ( ). Other possible extensions include extrapolation of µ(z) at z / H(X) with the M obius transformation (Zhou et al., 2019), regression problems in other metric spaces (e.g., manifold regression and structured output) as discussed in Hein (2009) and online regression problems (Kuzborskij & Cesa-Bianchi, 2017). Acknowledgement We thank the four anonymous reviewers for insightful suggestions that have significantly improved the paper. This research was supported by funding from the Research Grants Council of Hong Kong (17308420) and TCL Corporate Research (Hong Kong). Crystallization Learning Altman, N. S. An introduction to kernel and nearestneighbor nonparametric regression. The American Statistician, 46(3):175 185, 1992. Benedetti, J. K. On the nonparametric estimation of regression functions. Journal of the Royal Statistical Society: Series B (Methodological), 39(2):248 253, 1977. Betancourt, M. R. and Skolnick, J. Universal similarity measure for comparing protein structures. Biopolymers, 59(5):305 309, 2001. Chang, T. H., Watson, L. T., Lux, T. C. H., Bernard, J., Li, B., Xu, L., Back, G., Butt, A. R., Cameron, K. W., and Hong, Y. Predicting system performance by interpolation using a high-dimensional Delaunay triangulation. In Proceedings of the High Performance Computing Symposium, HPC 18, San Diego, CA, USA, 2018a. Society for Computer Simulation International. Chang, T. H., Watson, L. T., Lux, T. C. H., Li, B., Xu, L., Butt, A. R., Cameron, K. W., and Hong, Y. A polynomial time algorithm for multivariate interpolation in arbitrary dimension via the Delaunay triangulation. In Proceedings of the ACMSE 2018 Conference on - ACMSE 18. ACM Press, 2018b. Chang, T. H., Watson, L. T., Lux, T. C. H., Butt, A. R., Cameron, K. W., and Hong, Y. Algorithm 1012: DELAUNAYSPARSE: Interpolation via a sparse subset of the Delaunay triangulation in medium to high dimensions. ACM Transactions on Mathematical Software, 46 (4):1 20, 2020. Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368):829 836, 1979. Cleveland, W. S. and Devlin, S. J. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83(403): 596 610, 1988. Cover, T. and Hart, P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21 27, 1967. de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M. Delaunay triangulations. In Computational Geometry, pp. 191 218. Springer Berlin Heidelberg, 2008. Delaunay, B. Sur la sph ere vide. Bulletin de l Acad emie des Sciences de l URSS. Classe des sciences math ematiques et na, 6:793 800, 1934. Fan, J. and Gijbels, I. Local Polynomial Modelling and Its Applications. Routledge, 2018. Hardle, W. and Gasser, T. Robust non-parametric function fitting. Journal of the Royal Statistical Society. Series B (Methodological), 46(1):42 51, 1984. Hein, M. Robust nonparametric regression with metricspace valued output. In Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009. Kuzborskij, I. and Cesa-Bianchi, N. Nonparametric online regression while learning the metric. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Liu, Y. and Yin, G. The Delaunay triangulation learner and its ensembles. Computational Statistics & Data Analysis, 152:107030, 2020. Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964. Priestley, M. B. and Chao, M. T. Non-parametric function fitting. Journal of the Royal Statistical Society. Series B (Methodological), 34(3):385 392, 1972. Sibson, R. Locally equiangular triangulations. The Computer Journal, 21(3):243 245, 1978. Stone, C. J. Consistent nonparametric regression. The Annals of Statistics, 5(4):595 620, 1977. Tsanas, A., Little, M., Mc Sharry, P., and Ramig, L. Accurate telemonitoring of Parkinson s disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering, 57(4):884 893, 2010. Watson, G. S. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A (1961-2002), 26(4): 359 372, 1964. Yeh, I.-C. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797 1808, 1998. Zhou, Z., Tan, S., Xu, Z., and Li, P. M obius transformation for fast inner product search on graph. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.