# spectral_learning_of_multivariate_extremes__583a434d.pdf Journal of Machine Learning Research 25 (2024) 1-36 Submitted 11/21; Revised 12/23; Published 4/24 Spectral learning of multivariate extremes Marco Avella Medina marco.avella@columbia.edu Department of Statistics Columbia University New York, NY, USA Richard A. Davis rdavis@stat.columbia.edu Department of Statistics Columbia University New York, NY, USA Gennady Samorodnitsky gs18@cornell.edu School of Operations Research and Information Engineering Cornell University Ithaca, NY, USA Editor: Aryeh Kontorovich We propose a spectral clustering algorithm for analyzing the dependence structure of multivariate extremes. More specifically, we focus on the asymptotic dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory. Our work studies the theoretical performance of spectral clustering based on a random k-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that, under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods. Keywords: Angular measure, heavy tails, Laplacian, nearest neighbor graphs, regular variation, spectral clustering 1. Introduction Multivariate extremes arise when one or more of rare extreme events occur simultaneously. They are of paramount importance for understanding environmental risks such as fires or droughts since they are driven by joint extremes of a number of meteorological variables. Similarly, catastrophic financial events are also of a multivariate nature in financial systems driven by core institutions that are connected. In the above examples one is precisely interested in modeling the dependence between rare individual extremes. Multivariate extreme value theory is an active research area that provides tools for modeling such events. The dependence structure between extreme observations can be complex and typically characterized by different notions of dependence from the ones arising in the non-extreme world. For this reason recent work has sought to rethink various notions of sparsity for c 2024 Marco Avella Medina, Richard A. Davis and Gennady Samorodnitsky. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1367.html. Avella Medina, Davis and Samorodnitsky extremes (Goix et al., 2017; Meyer and Wintenberger, 2021; Simpson et al., 2020), concentration inequalities (Goix et al., 2015; Cl emen con et al., 2023), conditional independence (Belkin and Niyogi, 2003) and unsupervised learning (Chautru, 2015; Cooley and Thibaud, 2019; Janßen and Wan, 2020; Drees and Sabourin, 2021). See also Engelke and Ivanovs (2021) for a review of recent developments in the literature of multivariate extremes. Much of this line of research tries to connect important ideas from modern statistics and machine learning to the context of multivariate extremes. Our work falls in this category as we propose spectral clustering as a tool for learning the dependence structure of multivariate extremes. Spectral clustering (Von Luxburg, 2007) and related techniques are very popular and have found success in various applications such as parallel computing (Hendrickson and Leland, 1995; Van Driessche and Roose, 1995), image segmentation (Shi and Malik, 2000) and community detection (Rohe et al., 2011; Lei and Rinaldo, 2015; Zhou and Amini, 2019). The central idea of spectral clustering is to use the eigenvectors of the graph Laplacian matrix constructed from an affinity graph between sample points in order to find clusters in the data. Typically these are obtained by a K-means algorithm that take these graph Laplacian eigenvectors as input. We follow this same principle but use as input to our algorithm the angular parts of the observations whose norms exceed a certain large threshold i.e., a standard spectral clustering algorithm is applied to the graph built over the angular parts of these extreme observations. Because of the nature of the extreme events that we study, we leverage tools from multivariate extreme value theory for analyzing the theoretical properties of our spectral clustering algorithm. In particular, we use multivariate regular variation as a modeling tool since it is closely connected to asymptotic characterizations of multivariate extreme value distributions (Resnick, 2007, 2018). While a precise definition of regular variation is provided in Section 2, the basic idea is that a d-dimensional random vector X is regularly varying if the distribution of the angular part X/ X stabilizes (i.e., converges in distribution) as the radial part X becomes large and that the radial part has Pareto-like tails. The dependence structure is then governed by the asymptotic distribution of the limiting angular part. In this paper, we consider clustering of the angular parts, which live on a d-dimensional unit sphere, of observations with large radii. Learning this measure is challenging because of its multivariate nature and because only a small fraction of the data is considered to be extremes, i.e., those observations whose radii are sufficiently large, are retained for estimation. In contrast, standard modeling approaches built on parametric models are hard to extend to larger dimensions because of their lack of flexibility and computational complexity Davison and Huser (2015). We will explore the use of spectral clustering for learning the angular measure. The performance of the algorithm critically depends on the properties of the random graph that it takes as input. We will focus on k-nearest neighbor graphs and hence a decision has to be made about the size of k for constructing the random graph. In this work we study this question by focusing on a linear factor model. We characterize the asymptotic distribution of the multivariate extremes generated from this model and show that their dependence structure is captured by a discrete angular measure in the limit. We establish a rate of convergence for the angular components of the extremes to their discrete limits. This is a key step in deriving a theoretically valid range of numbers of k-nearest neighbors for Spectral learning of multivariate extremes constructing a nearest neighbor graph that one should consider in order to guarantee that spectral clustering can be successfully used to learn the asymptotic angular measure. From a methodological perspective, the work of Janßen and Wan (2020) is perhaps the closest to our approach since they also provide a clustering algorithm for extremes. Their method is however very different as it is based on spherical k-means (Dhillon and Modha, 2001), a variant of k-means that replaces the usual square loss minimization by an angular dissimilarity measure minimization. The data-generating model we consider is a natural factor model that can be viewed as a generalization of the max-linear model considered in Janßen and Wan (2020). We characterize the limiting distribution of the extremes in this model. We rigorously study the extremal nearest neighbor graphs and show that their connected components can identify the clusters of extremes of our factor model. By construction our algorithm is computationally tractable and model agnostic, so it has a potential of working well beyond the setting covered by our theory. The rest of the paper is organized as follows. Section 2 provides some background notions from multivariate regular variation necessary for our analysis. Section 3 introduces the proposed spectral clustering algorithm for extremes. In Section 4 we introduce our linear factor model (LFM) and derive the asymptotic distribution of the angular components X/ X of observations with high threshold exceedances i.e., observations X with very large X . In Section 5 we study the behavior of k-nearest neighbor graphs constructed using a sample of extremes. Section 6 contains a number of numerical examples that illustrate our proposed method. We show in Section 6.1 that for a large range of values of k the connected components of the nearest neighbor graph consistently identify the clusters of extremes arising from the linear factor model. This includes an examination of LFM with added noise. The spectral clustering method is still able to estimate the signal reasonably well. The good numerical performance of the method in the LFM plus noise context suggests that it might work well in more general settings. The spectral clustering method is also applied to an environmental data set consisting of daily measurements of five air pollutants over both winter and summer seasons. The analysis suggests that in modeling the extremes, a LFM model with 5 clusters seems appropriate. Moreover, viewed as a time series, the extremal dependence for O3 and NO2 does not extend beyond a second-day time lag. Proofs of the technical results in the body of the paper and their complements are contained in the appendix. 2. Background on multivariate regular variation Regular variation is often the starting point in modeling heavy-tailed data. We will make regular use of this assumption throughout this work. A random vector X = (X1, . . . , Xd) is said to be regularly varying with exponent α > 0 if for some norm on Rd and some probability measure Γ on the unit sphere Sd 1 in Rd, the following limits hold: lim r P X/ X | X > r Γ( ) (1) lim r P X > rx P X > r = x α (2) Avella Medina, Davis and Samorodnitsky for all x > 0, where denotes weak convergence on Sd 1. In other words, the law of the angular component X/ X stabilizes as the radial component becomes large, and the radial component is regularly varying (equation (2)) with index α. The limit probability measure Γ is called the angular measure (or spectral measure) and describes how likely the extremal observations are to point in different directions. In other words, the angular measure describes the limiting extremal angle for high threshold exceedances that correspond to large X . The support of this measure is particularly important since it shows which directions of the extremes are feasible and which are not feasible. Throughout the rest of paper we will take to be the Euclidean norm. For example, if X has a spherically symmetric distribution and the radius X has a Pareto distribution with index α, then X is regularly varying with angular measure that is uniform on Sd 1. In this case, the random vector is equally likely to have extremes in any direction so we do not expect extremes to be clustered. On the other hand, consider observations generated from a univariate MA(3) process given by Yt = Zt+.5Zt 1 .6Zt 2+ 1.5Zt 3, where {Zt} is an iid sequence of symmetric stable random variables with index α = 1.8. The bivariate vector Xt = (Yt, Yt 1) is regularly varying and the scatter plot of Yt vs Yt 1 is displayed in the left panel of Figure 1. Notice that for large values of Xt , the points align themselves on rays. In the right panel is a plot of Xt/ Xt for those values of Xt that exceed the 99.8% empirical quantile of the radii and are grouped in 10 clusters. In this particular case, the spectral distribution consists of 10 point masses (5 pairs of symmetric point masses, indicated by arrows emanating from the origin). 300 200 100 0 100 200 300 200 100 0 100 200 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Figure 1: Scatter plot of (Yt, Yt 1) for an MA(3) process (left); spectral measure on S1 This simple example illustrates the challenge in finding meaningful low dimensional regions supporting the extremes. In a series of papers (see Meyer and Wintenberger (2021), Spectral learning of multivariate extremes Drees and Sabourin (2021), Cooley and Thibaud (2019)), PCA-like analyses have been applied to finding low-dimensional subspaces that contain the bulk of the support of the spectral measure. As seen in this MA(3) example, such strategies might not be well suited for extracting the key features in the extremes which do not necessarily live neatly on a small number of subspaces. The approach taken here will be to use spectral clustering to learn the angular measure. While machine learning ideas provide guidance for thinking about and addressing multivariate extremes, the very nature of rare events will require us to borrow ideas from the theory of multivariate regular variation to analyze the extremal nearest neighbor graphs used by our algorithm. 3. Spectral clustering In this section we describe how to construct random graphs based on a sample of extremes and how to use such graphs to find clusters of extremes via a simple spectral clustering algorithm. 3.1 Constructing random graphs Starting with a sample of d-dimensional observations Xi, i = 1, . . . , n, one first needs to identify the extremal part of the sample, on which the extremal estimation will be performed. This is often done by selecting a high threshold un and assigning to the extremal part of the sample the observations Xi satisfying Xi > un. Assume that Nn observations Xi (with i in some set Vn of cardinality Nn) are in the extremal part of the sample. Associated with each i Vn, is the angular component of the observation Xi/ Xi that lives on the unit sphere. This allows us to think of the points in Vn as points on the unit sphere, forming nodes in a simple graph. We connect nodes i1 and i2 by an edge according to a certain rule. One possible rule chooses ϵ > 0 and connects i1, i2 Vn by an edge if ρ Xi1/ Xi1 , Xi2/ Xi2 ϵ . (3) One often uses the usual Euclidean distance ρ on the unit sphere Sd 1 in (3); but another distance function on the unit sphere could also be used. The random set of edges En created in this fashion define an ϵ-neighborhood graph. In what follows, we will focus on a different rule, leading to the k-Nearest Neighbor graphs (k-NN graphs). This rule asserts that a node i1 Vn is connected to a node i2 Vn if the point on the unit sphere corresponding to i2 is among the k-nearest neighbors of the point corresponding to i1, according to some distance function. This definition leads to a directed graph because the neighborhood relationship is not symmetric. There are two natural ways of making this graph undirected. The first one is to connect i1 and i2 with an undirected edge if either i1 is among the k-nearest neighbors of i2 or i2 is among the k-nearest neighbors of i1. The second one connects i1 and i2 only if both conditions are met, i.e., when i1 and i2 are mutual nearest neighbors (hence the resulting graph is usually called mutual k-nearest neighbor graph). Our main results apply to both constructions. We work with weighted graphs, where we assign to the edges a weight equal to the distance between the points on the unit sphere defining the nodes. More specifically, we will take as Avella Medina, Davis and Samorodnitsky input to our algorithm the weighted adjacency matrix W = [wi1i2]i1,i2 Vn and ( d Xi1/ Xi1 , Xi2/ Xi2 if i1 and i2 are connected, 0, if i1 and i2 are not connected. (4) When defining the weights in (4) d is a certain kernel; a typical example of such a kernel e.g., d(x, y) = exp( x y ), is used in the examples of Section 6. In the following subsections, we describe our algorithm and highlight the theoretical challenges. 3.2 The algorithm The degree of a node i Vn is defined as The degree matrix D is defined as the diagonal matrix with diagonal elements [di]i Vn and the normalized symmetric graph Laplacian matrix is defined as L = I D 1/2WD 1/2, (5) where I is the identity matrix. The spectral clustering algorithm of Ng et al. (2002) proceeds as follows: 1. Compute the first m eigenvectors u1, . . . , um of L (i.e., the eigenvectors corresponding to the m smallest eigenvalues of L) and define an Nn m matrix U using these eigenvectors. 2. Form an Nn m matrix V by normalizing the rows of U to have unit norm. 3. Treating each of the Nn rows of V as a vector in Rm, cluster them into m clusters C1, . . . , Cm using the K-means algorithm. 4. Assign the original points Xi to cluster Cj if and only if row i of the matrix V was assigned to cluster Cj. The motivation for this algorithm is described below. 3.3 Connected components, Laplacian and k-nearest neighbor graph. We say that a subset A Vn of the vertices of a graph is connected if any two vertices in A can be joined by a path of edges such that all intermediate vertices also lie in A. If A is connected and there are no connections between A and Vn \ A, then A is called a connected component. It is well known that the number of connected components of a graph G is related to the spectrum of its symmetric graph Laplacian. This is formalized in the following proposition (Von Luxburg, 2007, Proposition 2). Proposition 1 Let G = (V, E) be an undirected graph with non-negative weights. Then the multiplicity m of the eigenvalue 0 of L equals the number of connected components A1, . . . , Am in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors δA1, . . . , δAm of those components. Spectral learning of multivariate extremes It follows from this result that if the spectral clustering algorithm is applied to a graph with the number of connected components equal to the parameter m in the algorithm, then the algorithm will identify the connected components. In Sections 4 and 5 we derive the asymptotic angular distribution of multivariate extremes arising from a linear factor model and provide the relevant asymptotic theory for the connected components of the kn-nearest neighbor graph constructed using the angular components of the extremes. Specifically, it will be rigorously established that under certain conditions the spectral clustering of the resulting graph consistently estimates the support of the spectral measure of multivariate extremes arising from this model. 4. Linear factor model and convergence of the angular components We now introduce the generative model that we will be studying in this paper. Let X be a d-dimensional random vector defined by the following linear factor model (LFM) X = AZ , (6) where A = [aij]i=1,...d;j=1,...p is a d p matrix of nonnegative elements and Z is a pdimensional random vector of factors consisting of independent and identically distributed random variables, that are either nonnegative or symmetric, and have asymptotically Pareto tails, i.e., P(Z1 > z) cz α, as z (7) for some α > 0 and c > 0. Note that we write f(x) g(x) as x to mean that limx f(x)/g(x) = 1. In the examples section, we will add noise to the model in (6) in which case, the model corresponds to a standard heavy-tailed linear factor model. One can think of (6) as a linear version of the max-linear model studied in Janßen and Wan (2020), which has the same spectral distribution. We will also relax the assumption that the matrix A is non-negative and will allow the noise to be symmetric. In this case, the spectral distribution of the model is no longer constrained to the positive quadrant of Sd 1. Related max-linear models have also been considered in the context of time series models for extremes (Davis and Resnick, 1989; Hall et al., 2002) and more recently in the context of structural equation models (Gissibl and Kl uppelberg, 2018; Kl uppelberg and Lauritzen, 2019). The asymptotic Pareto assumption in (7) can be weakened to regular variation, at least for the main results in this section. However, this comes at the expense of assuming a more intrinsically complex set of conditions on the choice of thresholding sequences. Additional conditions, such as existence and properties of the density function of the noise, are required for the proofs of the results in Section 5. It follows immediately from (6) and (7) (see, for example, Basrak et al. (2002), Proposition A.1) that X is a multivariate regularly varying random vector satisfying (1); namely, lim x P X X | X > x, Γ( ) , (8) where denotes weak convergence on the unit sphere Sd 1, Γ is a discrete probability measure on Sd 1 that, in the nonnegative case, puts mass a(k) α/w at a(k)/ a(k) for Avella Medina, Davis and Samorodnitsky k = 1, . . . , p, where a(k) = (a1k, a2k, . . . , adk) , is the kth column of the matrix A and k=1 a(k) α . (9) In other words, Γ has the representation Γ( ) = w 1 p X k=1 a(k) αδ a(k) a(k) ( ) , (10) where δx( ) is the Dirac measure that puts unit mass at x. On the other hand, in the symmetric case, Γ puts mass a(k) α/2w at a(k)/ a(k) for k = 1, . . . , p. That is, the number of point masses in the symmetric case is double of that number in the nonnegative case. 1 Based on a random sample of iid copies of X1, . . . , Xn of X as above, we construct an estimate of the location of the point masses that comprise Γ, i.e., a(k) , k = 1, . . . , p (11) in the nonnegative case, and a(k) , k = 1, . . . , p (12) in the symmetric case. Note that these ck are not necessarily distinct. Intuitively, for large n, the angular parts Xi/ Xi of the sample for which Xi is large, will cluster around these ck. In fact, we formalize this intuition and provide a rate of convergence for the limiting extremal angles with high threshold exceedances in the next theorem. This will be a key ingredient in our convergence analysis of extremal k-NN graphs. For the ease of notation we will prove the following result in the nonnegative case; the symmetric case follows by simply doubling the number of points on the sphere. Theorem 2 If (un) is a sequence converging to infinity as n , then, in the nonnegative case, for any j = 1, . . . , p, the conditional law of given X > un, Zj > un/w1/α, (w defined in (9)) converges weakly to the law of S 1, j, . . . , S d, j , 1. Without much additional effort, one could consider the case that the tails of Z1 are balanced in the sense that limx P(Z1 > x)/P(|Z1| > x) p+ [0, 1]. The location of the point masses for Γ would be exactly the same as in the symmetric case, but with mass p a(k) α/w at ck, defined in (12), where p = 1 p+. Spectral learning of multivariate extremes where Wα has a Pareto distribution (i.e. P(Wα > x) = x α, x 1) that is independent of Z1, . . . , Zp, a2 ij Xl, j aljaij Xi, j , (13) Xi, j = Xi aij Zj = aim Zm . (14) Proof We start by observing that the conditional law of Z1, . . . , Zj 1, Zj/un, Zj+1, . . . , Zp (15) given X 2 > u2 n, Zj > un/w1/α, converges in distribution, as n , to the law of Z1, . . . , Zj 1, Wα/wj, Zj+1, . . . , Zp , (16) where wj = a(j) The main ingredients in establishing this result is to note that Z2 j is regularly varying with index α/2 while for i = j, Zi Zj is regularly varying with index α (see Embrechts and Goldie (1980); Theorem 3). Moreover, from the convolution closure property for sums of independent regularly varying random variables, it follows easily that P(Zj > unx , i=1 aki Zi)2 > u2 n, ) P(Zj > unx, i=1 w2 i Z2 i > u2 n) i=1 P(Zj > unx, w2 i Z2 i > u2 n) P(Zj > unx) . Now to finish the proof of (16), we use these relations and note that for x 1/wj (and hence x 1/w1/α), P(Zj > unx | X 2 > u2 n, Z2 j > u2 n/w2/α) P(Zj > unx, Pp i=1 w2 i Z2 i > u2 n, Z2 j > u2 n/w2/α) P(Pp i=1 w2 i Z2 i > u2n, Z2 j > u2n/w2/α) P(Zj > unx) P(Zj > un/wj) = P(Wα > wjx) . un X/ X cj = un (Pp m=1 a1m Zm, . . . , Pp m=1 adm Zm) = (V1, . . . , Vd) , Avella Medina, Davis and Samorodnitsky say. For l = 1, . . . , d, Vl = un wj Pp m=1 alm Zm alj X = un w2 j (Pp m=1 alm Zm)2 a2 lj X 2 wj X (wj Pp m=1 alm Zm + alj X ) . (17) We note that the numerator of (17) reduces to the following expression where the Z2 j terms cancel out Numl : = 2alj Zj a(j) 2Xl, j alj i=1 ai,j Xi, j + a(j) 2X2 l, j a2 lj i=1 X2 i, j a2 ij Xl, j aljaij Xi, j ! + a(j) 2X2 l, j a2 lj i=1 X2 i, j , (18) where Xi, j is as defined in (14). The denominator of (17) is handled in a similar way, but this time the Z2 j terms do not cancel. Indeed, since k=1 (a2 kj Z2 j + 2akj Zj Xk, j + X2 k, j) k=1 a2 kj Z2 j Pd k=1(X2 k, j + 2akj Zj Xk, j) Pd k=1 a2 kj Z2 j Pd k=1(X2 k, j + 2akj Zj Xk, j) w2 j Z2 j , we can write Denl = w2 j|Zj|(1 + op(1)) (wjalj Zj + R + aljwj|Zj|(1 + op(1))) , where R is a linear function in the variables Z1, . . . , Zp in which Zj does not appear, and op(1) goes to zero in probability given Zj > un/w1/α. We view Vl = un Numl Denl = Numl/un as a ratio of two continuous real-valued functions of the random vector in (15) (plus a vanishing term in Denl), so that the random vector (V1, . . . , Vd) becomes a d-dimensional vector of such ratios. By the continuous mapping theorem the random vector (V1, . . . , Vd) converges weakly to the d-dimensional vector of the ratios of the corresponding functions applied to the random vector in (16). These result in 2alj(Wα/wj)S l, j Spectral learning of multivariate extremes in the case of Numl and in W 2 α 2aljwj in the case of Denl. Putting everything together produces the claim. As explained above, the following corollary is an immediate consequence of Theorem 2. Corollary 3 Let a sequence of levels (un) converging to infinity. Then, in the symmetric case, in the notation of Theorem 2, for any j = 1, . . . , p, the conditional law of given X > un, Zj > un/w1/α, converges weakly to the law of S 1, j, . . . , S d, j . Remark 4 In the nonnegative case, Theorem 2 addresses the conditional convergence of X/ X , given X > un, Zj > un/w1/α, to the location cj of the corresponding atom of the spectral measure. It is also possible to address a conditional convergence to the mass w 1 a(j) α of this atom. Indeed, P Zj > un/w1/α X > un w 1 a(j) α (19) as n . To see this, write P Zj > un/w1/α X > un = uα n P Zj > un/w1/α, X > un uαn P X > un , and the numerator converges to c a(j) α, while the denominator converges to cw. If one strengthens the asymptotic Pareto tails assumption (7) to include the rate of convergence to the limit, then one would able to establish the rate of convergence in (19) as well. The situation is similar in the symmetric case. We do not pursue this in the present paper. We now explore the connection between large values of the underlying factors Zi1, . . . , Zip and large values of Xi . We will see that under certain conditions, high threshold exceedances of Xi are generated by only one underlying factor Zij, j = 1, . . . , p. This will be important for our analysis of extremal k-NN graphs which will require additional assumptions on the rate of growth of un. Since α+2 α(α+3) < α 1, we can further impose that the sequence (un) satisfies the growth conditions n 1/αun 0 and n (α+2)/(α(α+3))un , (20) as n . Also note that we may choose a further sequence (hn) such that hn , hn = o(un), hn = o u(α+1)/2 n n 1/2 , n 1/αunhn (21) as n . Indeed, the choice hn = u(α 1)/4 n n(2 α)/(4α) Avella Medina, Davis and Samorodnitsky works for this purpose. For n = 1, 2, . . . , we define the set of indexes corresponding to extreme observations In = i = 1, . . . , n : Xi > un , (22) and denote its cardinality by Nn =card(In). From (7) and (20), we see that the mean and variance of Nn/(nu α n ) converge to cw and 0, respectively and hence that Nn/(nu α n ) P cw, as n . (23) The following lemma connects exceedances of un by Xi with exceedances of hn by Zij, j = 1, . . . , p. Lemma 5 Let (hn) be a sequence satisfying (21) and consider the event Bn = for any i In at most one of Zim, m = 1, . . . , p exceeds hn . Then P(Bn) 1 as n . Proof Note that (6) implies that the mth component of the ith observation is of the form j=1 amj Zij, m = 1, . . . , d; i = 1, . . . , n , (24) where Zi1, . . . , Zip are iid random variables with asymptotic Pareto tails (7). Denote a = d1/2 max{amj, m = 1, . . . , d; j = 1, . . . , p} (25) Since un > hn for n large, we have Xik 2 > u2 n, Zim > hn for two or more of m = 1, . . . , p n P a max k=1,...,p Z1k > un, Z1m > hn for two or more of m = 1, . . . , p n P(Z1 > hn)P (Z1 > un/a ) 0 by the last property in (21) and (7). This proves the lemma. Equipped with Lemma 5 we can now proceed to bound the distance between the observed angular parts of the multivariate extremes and their corresponding theoretical asymptotic atoms. Assume, for a moment, that we are in the nonnegative case. We already know that for large n, we have that for every i In one of the values of Zim, m = 1, . . . , p must exceed un/a and all other values of these variables cannot exceed hn. We now define the sets of Spectral learning of multivariate extremes indexes corresponding to extremes generated by each of the individual factors i.e. we define for j = 1, . . . , p I(j) n = {i = 1, . . . , n : Xi > un, Zij > un/a } . (26) Consequently, j=1 I(j) n (27) and by Lemma 5 for large n this is a disjoint union with probability tending to one. Let N(j) n be the cardinality of I(j) n , j = 1, . . . , p. Using the fact that a a(j) for j = 1, . . . , p, the same argument as in (23) shows that j = 1, . . . p, N(j) n /(nu α n ) P c a(j) α, as n . (28) We enumerate Xi/ Xi , i In as Yi, i = 1, . . . , Nn, a sample on Sd 1 of random size Nn. For each j = 1, . . . , p, we enumerate Xi/ Xi , i I(j) n as Y(j) i , i = 1, . . . , N(j) n , a sample on Sd 1 of random size N(j) n . It is straightforward (if a bit tedious) to check the following result. Lemma 6 For large n, on the event Bn, for i = 1, . . . , N(j) n , Y(j) i cj 8(a )2 a(j) α hn un , j = 1, . . . , p, (29) where the cj are as defined in (12). The situation in the symmetric case is, of course, completely analogous. It follows from the definition of hn in (21) and (29) that the angular components of the extremes are clustered around the centers cj. The results in the next section build on Lemma 6 and provide sufficient conditions for our extremal spectral clustering algorithm to be consistent. For this, we provide a careful asymptotic analysis of the extremal k-NN graph used by the algorithm. 5. Asymptotic analysis of the connected components of the extremal k-NN graph Our analysis consists of two main components. The first one is to show that the extremes generated by different factors will belong to different components of the extremal k-NN graph as long as the cluster centers corresponding to the underlying factors are different. The second part will be to argue that all the extremes generated by an underlying factor will also belong to the same component of the extremal k-NN graph. This second step turns out to be the more technical one in our analysis and we will only establish this result for d = 2. Along the way we derive a few intermediate results that we also highlight in order to better explain the key ingredients of our argument. Going forward, in our proofs, c > 0 represents a finite and non-zero constant whose value may change from line-to-line. In the sequel we will assume, without further comments, that the sequence (un) satisfies (20). The first step of our program is covered by the following proposition. Avella Medina, Davis and Samorodnitsky Proposition 7 Suppose that kn = o nu α n as n . Then there is a sequence (Bn,1) of events with P(Bn,1) 1 as n such that, for all n large enough, on the event Bn,1, any two points Y(j1) i1 and Y(j2) i2 , i1 = 1, . . . , N(j1) n , i2 = 1, . . . , N(j2) n will belong to two different connected components of the kn-NN graph if cj1 = cj2. Proof Define Bn,1 = Bn N(j) n > kn for j = 1, . . . , p . (30) By Lemma 5, (28) and the assumption on kn , P(Bn,1) 1 as n . By (29) and the triangle inequality, on the events Bn,1, any point Y(j1) i1 has at least kn neighbours of the type Y(j1) i , i = 1, . . . , N(j1) n , i = i1, that are within distance of c hn/un from it. On the other hand, by (29) and the triangle inequality, its distance from any point Y(j2) i2 with cj1 = cj2 cannot be smaller than cj1 cj2 c hn/un. Therefore, for large n, the latter point cannot be among the kn-nearest neighbours of Y(j1) i1 . We now embark on the second step of our program and establish that, at least in the case d = 2, under appropriate conditions, the points Y(j) i , i = 1, . . . , N(j) n , belong, with high probability, to the same connected component in the kn-NN graph. We start by investigating the deviations of these points from the center of the cluster, cj, defined in (12). Since the points Y(j) i , i = 1, . . . , N(j) n are treated as independent, the following result is essentially immediate from Theorem 2. Lemma 8 In the nonnegative case, for any j = 1, . . . , p, the conditional law of un Y(j) 1 cj given N(j) n 1, converges weakly to the law of S 1, j, . . . , S d, j T that is specified in the statement of Theorem 2. An analogous statement holds in the symmetric case. Remark 9 It is a straightforward calculation to check that, if j = 1, . . . , d, l=1 alj S l, j = alja2 ij Xl, j a2 ljaij Xi, j = 0 . (31) Therefore, the normalized deviations of the points Y(j) i , i = 1, . . . , N(j) n from the center of the jth cluster are, in the limit, supported by a (d 1)-dimensional subspace. Spectral learning of multivariate extremes Using the information in Lemma 8 we now proceed to prove that under appropriate conditions, the points Y(j) i , i = 1, . . . , N(j) n , belong, with high probability, to the same connected component of the kn-NN graph. We will need some additional notation in order to state the result. For a fixed j = 1, . . . , d we write for m = 1, . . . , d, l=1 aml Z( ,j) il , i = 1, . . . , N(j) n ; (32) the notation should be compared with (24). That is, {(Z( ,j) i1 , . . . , Z( ,j) ip ) , i = 1, . . . , N(j) n } are iid random vectors distributed according to the conditional distribution of (Z1, . . . , Zp) given X > un, Zj > un/w1/α. Since the connectivity of any nearest neighbor graph is not affected by shifting and scaling, it is sufficient to consider the kn-NN graph constructed on the deviations of the points Y(j) i , i = 1, . . . , N(j) n from the cluster center. Continuing with the notation used in the proof of Theorem 2 we isolate the main term in the deviations from the cluster center by writing un Y(j) i cj = S( ,j) 1, j, . . . , S( ,j) d, j / w2 j(Z( ,j) ij /un) (33) + un Y(j) i cj S( ,j) 1, j, . . . , S( ,j) d, j / w2 j(Z( ,j) ij /un) = M(i) + D(i), i = 1, . . . , N(j) n , where S( ,j) l, j = Y (j) il alj Z( ,j) ij = Pp m=1, m =j alm Z( ,j) im is analogous to (14). In the case d = 2, it follows from (31) that for some nonzero deterministic vector b in R2, S( ,j) 2, j Z( ,j) ij /un b, i = 1, . . . , N(j) n . For notational simplicity we continue the discussion with j = 1, and in this case these are essentially univariate iid random variables with the distribution of Tn = a22Z2 + + a2p Zp w2 1Z1/un (34) (a11Z1 + + a1,p Zp)2 + (a21Z1 + + a2p Zp)2 > u2 n, Z1 > un/w1/α . (35) Finally, we let FTn denote the conditional law of Tn in (34) given the conditions in (35). For technical reasons we require further conditions on the latent factors in our subsequent results. We assume that the generic noise variable Z in (6) and (7) is positive or symmetric, and has a probability density function f Z such that f Z(z) is bounded away from 0 on compact intervals and bounded from above, (36) and B 1z (α+1) f Z(z) Bz (α+1), (37) Avella Medina, Davis and Samorodnitsky α > 1, for all z z0, some B 1. The following lemma shows that the conditional density function f Tn(t) = FTn(t)/ t enjoys some useful regularity properties. Lemma 10 Assume (36) and (37). For α > 1 the conditional density function f Tn is such that: (i) There exists an G (0, ) such that for all n large enough, f Tn(t) G for all t. (ii) f Tn is uniformly bounded from below on compact intervals, uniformly in n. (iii) There is a constant D 1 and a number t0 0 such that D 1t (α+1) f Tn(t) Dt (α+1) uniformly for all n large enough and all t t0. It is clear that an analogous result holds for the appropriate conditional densities in the symmetric case. The following intermediate result is the key ingredient for completing our analysis of the connected component of the extremal kn-NN graph, at least in the case d = 2, assuming certain regularity conditions on the noise variables. Lemma 11 Assume (36), (37) and let d = 2, τ > 1 and consider the random variable mn defined by mn = N(1) n / τ log N(1) n , so that by (23) mn cw1nu α n τ log(nu α n ), n . (38) Define the intervals Ii,n = F 1 Tn (i 1)/mn , F 1 Tn i/mn , i = 1, . . . , mn (39) as well as intervals along vector b by Ji,n = Ii,nb, i = 1, . . . , mn . Then, on an event with probability tending to one, there is a finite positive integer K0 such that for all n large enough and all i = 2, . . . , mn K0, every point in Ji,n is closer to every point in the intervals Ji 1,n and Ji+1,n than to any point in an interval Ji1,n with |i i1| > K0. The proofs of these two lemmas are contained in the Appendix. We are now ready to state the main result showing that the extremes generated from the same underlying factor will also belong to the same connected component of the extremal kn-NN graph with probability tending to one, under appropriate regularity conditions. This time, for simplicity, we only consider the symmetric case. Theorem 12 Assume (36), (37) and let d = 2. Then, if kn > G log n with large enough G > 0, there is a sequence (Bn,2) of events with P(Bn,2) 1 as n such that, for all n large enough, on the event Bn,2, any two points Y(j) i1 and Y(j) i2 , i1 = 1, . . . , N(j) n , i2 = 1, . . . , N(j) n will belong to the same connected component of the kn-NN graph for any j = 1, . . . , p. Spectral learning of multivariate extremes The proof of the theorem has been relegated to the appendix. It follows from Proposition 7 and Theorem 12 that, with probability tending to one as n , the extremal kn-NN graph obtained from a sample drawn from (6) will have exactly m p connected components corresponding to the m distinct asymptotic point masses (12) of the model. In other words, the extremal kn-NN graph consistently identifies the underlying clusters through its connected components. This in turn implies that spectral clustering will be consistent by Proposition 1. We have therefore shown the following main practical result. Corollary 13 Assume (36), (37), d = 2, kn = o(nu α n ) and kn > G log n. Then, spectral clustering will consistently identify the clusters of extremes arising from the linear factor model. Remark 14 In practice consistent clustering can be achieved by taking kn > G0 log Nn for some G0 > 0 and kn = o(Nn) since Nn/(nu α n ) P cw, as n . In our experiments we chose kn = Nn τ log Nn + 1 for some τ > 1. Corollary 13 suggests a simple strategy for estimating the asymptotic angular measure of the extremes generated from the linear factor model (6). Assume we run spectral clustering on the extremal kn-NN graph. Then we can denote by ˆI(j) n the set of indices corresponding to the jth cluster found by the algorithm for j = 1, . . . , m. With these sets we can define ˆN(j) n = card(I(j) n ) and estimate the centers of the spectral measure and their respective masses as ˆcj = 1 ˆN(j) n Xi Xi and ˆπj = ˆN(j) n Nn , j = 1, . . . , p. (40) The following result is an inmediate consequence of the main results of this section. Corollary 15 Suppose m = p and that the conditions of Proposition 7 and Theorem 12 hold. Then, ˆcj P cj and ˆπj P w 1 a(j) α for all j = 1, . . . , p. Note that in practice one can also normalize the estimates ˆcj to ensure that they lie in the unit sphere for all n. Clearly the resulting estimators remain consistent under the conditions of Corollary 15. Even though the theoretical results of this section use the assumption α > 1 in (37), we believe they should also hold when α (0, 1]. The numerical experiments shown in the next section supports this assertion. 6. Numerical illustrations In all the examples considered below we compute weighted adjacency matrices using the exponential kernel d(x, y) = exp( s x y ) with s = 1 and select the number of clusters as suggested by the screeplots of the fully connected weighted adjacency matrices W. It matched nicely the correct number of clusters, when known. We consider sample sizes n = {1000, 5000, 25000, 125000} and take a sample of extremes corresponding to observations whose Euclidean norm is larger or equal to the following vector of corresponding sample Avella Medina, Davis and Samorodnitsky quantiles: β = {0.9, 0.96, 0.984, 0.9968}, respectively. These quantiles were chosen to lead to samples of extremes of sizes Nn = {100, 200, 400, 800}, correspondingly. For these extremes we define kn-nearest neighbors graphs with kn = Nn C log Nn + 1, the corresponding values of the constants are in the vector C = {3, 5, 7, 9} . 6.1 Linear factor model with and without noise As a first example, consider d dimensional vectors that follow the p dimensional linear factor model X = AZ + σε , (41) where A Rd p is a matrix of factor loadings, Z = (Z1, . . . , Zp) is a p-dimensional vector consisting of iid standard Fr echet distributed components (α = 1), σ 0 regulates the signal to noise ratio and ε is a noise vector obtained by multiplying a univariate independent standard Fr echet with an independent p-dimensional random vector of iid standard normals, i.e., ε = Nη , (42) where η is standard Fr echet, N = (N1, . . . , Np) is a p random vector consisting of iid standard normals, and Z, η, and N are independent. Now using computations similar to those given in Section 4, it can be shown that P(Z1 > x) P(Pp i=1 a(i) 2Z2 i + σ2 N 2η2 > x2) P(Z1 > x) Pp i=1 P( a(i) 2Z2 i > x2) + P(σ2 N 2η2 > x2) i=1 a(i) + σE N , as x , (43) where the last line follows from an application of Breiman s lemma, see Breiman (1965). Taking this calculation one step further, we find that the angular measure associated with the model (41) is (see (10)), Γ( ) = w 1 p X i=1 a(i) δ a(i) a(i) ( ) + σE N δ N N ( ) where w = Pp i=1 a(i) + σE N . In other words, Γ has discrete mass points at a(i) a(i) with probability a(i) /w, i = 1, . . . , p and a uniform distribution N/ N on Sd 1 with probability σE N /w. This latter piece corresponds to the noise component σε. So the goal here is to identify the discrete components of Γ using our method when the model does not strictly follow the LFM. Figure 2 shows pairwise scatter plots of the angular components of extremes generated from a pure signal and a noisy LFM with σ > 0. We note that if σ = 0, then model (41) is approximately equal to the max-linear model X = ( k j=1a1j Zj, . . . , k j=1apj Zj) and will in fact have the same asymptotic spectral measure. Intuitively, this model generates p clusters of extremes since the noise term is only adding uniform noise to the angular measure. Spectral learning of multivariate extremes 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 (a) Pure signal LFM 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 (b) Noisy LFM Figure 2: Pairwise scatterplots of the angular part of the extremes generated from (41) with factor loading matrix (45), n = 125000, Nn = 400 and σ = {0, 1}. In both cases there are two clear clusters corresponding to the signal. The red points in subfigure (b) denote extremes attributed to the signal A Zi. As part of a simulation study, we consider σ = {0, 1, 3, 5} and choose 0.1 0.9 0.2 0.8 0.3 0.7 0.4 0.6 This model is similar to one of the max-linear models considered in the simulations of Janßen and Wan (2020) where our factor loading matrix A can be viewed as a deterministic version of their random factor loadings. In the simulations we took two clusters for the pure signal model where σ = 0 and three clusters for the noisy model when σ > 0 as these values are suggested by the typical screeplots we observed; see Figure 3. Avella Medina, Davis and Samorodnitsky We compute the normalized columns of the A matrix which correspond to the location of the point masses of the spectral distribution (these are the ck, k = 1, 2 in (12)). After applying our method to a single realization of size n = 125000 with Nn = 400, kn = 400 5 log 400 + 1 = 15, visualized in the pairwise scatter plots of Figure 2, we obtained the estimates of the ck represented in Figure 4. These masses on the sphere are estimated by taking the mean of all members in each of the identified clusters, seen in Figure 5, and then normalizing it to lie on the unit sphere. The two panels in Figure 4 correspond to the cases of 2 clusters and no noise and two clusters with uniform noise. In the first plot, the heat map does a good job in recreating the relatives size of the mass locations. In the second panel, the first two columns of the matrix, also reproduce the relative sizes of the columns (increasing in the first and decreasing in the second) of the A matrix. The third column corresponds to the cluster of points that have not been assigned to either of the first two clusters. As such they are essentially scattered uniformly around the unit sphere but away from the locations of the point masses corresponding to the first two columns. This is reflected in the third cluster having more negative values as indicated by the softer (red colors) in the heat map. 0 5 10 15 20 25 30 0 50 100 200 pure signal eigenvalues 0 5 10 15 20 25 30 0 50 100 150 signal + noise eigenvalues Figure 3: Screeplots of fully connected kernel matrix of pure signal and noisy linear factor models noise models. Spectral learning of multivariate extremes 0.5 1.0 1.5 2.0 2.5 Clusters (a) pure signal 1 2 3 Clusters (b) uniform noise on the sphere Figure 4: The heat maps show the estimated cluster centers based on the cluster assignments displayed in Figure 5. The extremal sample corresponds to four dimensional extremes generated from LFM given by (41) with loading matrix (45) and σ = 0 and σ = 1 respectively. In both cases we took n = 125000, Nn = 400 and kn = 15. 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 (a) Pure signal LFM 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 (b) Noisy LFM Figure 5: Cluster assignments output of spectral clustering applied to data generated from (41) with n = 125000, Nn = 400 and σ = {0, 1}. In both cases spectral clustering used an extremal 15-NN graph. A small simulation study was conducted for this LFM model with and without noise. Based on the screeplots, we used 2 clusters in the noiseless case and 3 clusters in the case with noise. The two normalized columns of the A were estimated and the boxplot of the Avella Medina, Davis and Samorodnitsky estimation error measured in Frobenius norm are displayed in Figure 6 for σ = 0 and in Figure 7 for the case σ > 0. The succession of boxplots in each row correspond to an increasing Nn, with the centers and width becoming smaller. Note that the scales on the plots change across the row. The boxplots in blue correspond to our method with difference choices of nearest neighbors as a function of C, and the yellow boxplot is based on the spherical k-means approach considered in Janßen and Wan (2020). In the σ = 0 case, our method performs about the same or slightly better than the spherical k-means method. However, as one adds noise to the model, our method generally outperforms spherical kmeans. In models with larger noise, it can be more difficult to estimate the LFM signal. So to compare performance across difference sample sizes and level of noise, we can calibrate by calculating a notion of signal to noise ratio. In this context we consider the part of the mass in the angular measure associated to the signal in (41), which as a function of σ is given by SNR(σ) := Pp i=1 a(i) Pp i=1 a(i) + σE N . In the absence of any noise, i.e., σ = 0, then SNR is 1 while as σ , SNR converges to 0. For the simulation example above for which d = 4, p = 2, we have E N = 2Γ(5/2)/Γ(2) = 1.880. Hence SNR(σ) = 2.065/(2.065 + σ1.880). In looking at the various plots in Figure 7, it is instructive to compute the effective sample size given by ESS= SNR Nn. This number essentially gives the expected sample size of the number of observations, from the total Nn, that come from the signal. With this index in mind, plots that have the same ESS values (reported in the caption of Figure 7) generally show similar results since the procedures are applied to the roughly the same number of extreme observations attributed to the signal component in the model. We finally note that in this simulation α = 1 which is not currently covered by our LFM theory but is the setting proposed in the simulations of Janßen and Wan (2020). We carried out simulations with α = 0.5 and α = 2 and obtained qualitatively the same type of results as the ones reported here. The only noticeable difference was that spherical k-means seems to work better with larger α in the noisy model, but is much worse for small α. In both cases spectral clustering outperformed spherical k-means. C=3 C=5 C=7 C=9 0.00 0.01 0.02 0.03 0.04 Factor model, N=100 Frobenius norm Spectral clustering Spherical k means C=3 C=5 C=7 C=9 0.000 0.001 0.002 0.003 0.004 0.005 Factor model, N=200 C=3 C=5 C=7 C=9 0.0000 0.0005 0.0010 0.0015 0.0020 Factor model, N=400 C=3 C=5 C=7 C=9 0e+00 2e 04 4e 04 6e 04 8e 04 Factor model, N=800 Figure 6: Estimation error measured in Frobenius norm when σ = 0. Spectral learning of multivariate extremes C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy factor model, N=100 Frobenius norm Spectral clustering Spherical k means C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=200 C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=400 C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=800 C=3 C=5 C=7 C=9 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=100 Frobenius norm C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=200 C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=400 C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=800 C=3 C=5 C=7 C=9 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=100 Frobenius norm C=3 C=5 C=7 C=9 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=200 C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=400 C=3 C=5 C=7 C=9 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Noisy LFM, N=800 Figure 7: Estimation error of cluster centers measured in Frobenius norm. data was generated from the noisy LFM (41). The sample sizes increases from left to right as n = {1000, 5000, 25000, 125000} and Nn = {100, 200, 400, 800}, and from noise level increases from the top down as σ = {1, 3, 5}. Across rows the ESS are: {52, 105, 209, 418}, {27, 54, 107, 214}, {18, 36, 72, 144} 6.2 Bivariate extremes from MA(3) We consider the model discussed in the introduction and represented in Figure 1. More specifically, the model is Yt = Zt+.5Zt 1 .6Zt 2+1.5Zt 3, where {Zt} is an iid symmetric stable random variables with index α = 1.8. We analyze the extremal dependence structure Avella Medina, Davis and Samorodnitsky 0 5 10 15 20 25 30 0 20 40 60 80 100 fully connected W eigenvalues 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 35 NN graph Figure 8: Screeplot of kernel matrix and clustering performance of 2 dimensional MA(3) extremes when n = 25000 and Nn = 400. of the bivariate vector Xt = (Yt, Yt 1) by looking for clusters in the extremes of Xt. This model can be written in the form (6) since we can define Zt = (Zt, Zt 1, Zt 2, Zt 3, Zt 4) and hence Xt = 1 0.5 0.6 1.5 0 0 1 0.5 0.6 1.5 Note that even though in this case the sample {Xt} is not independent, the asymptotic distribution obtained in Theorem 2 still holds. In particular, the angular distribution is supported in the points (12) i.e. c1, = (1, 0), c2, = (.5, 1)/ 1.25, c3, = ( 0.6, 0.5)/ c4, = (1.5, 0.6)/ 2.61 and c5, = (0, 1). Figure 8 illustrates the behavior of spectral clustering for this model when Nn = 400 and kn = 400 2 log 400 + 1 = 35. The screeplot suggests 5 clusters corresponding to the 5 columns in the factor loading matrix. However, due to the symmetry of the actual factors in Zt, each column and its negative value constitute a cluster. So the 10 clusters actually reflect the 5 factors since each is paired with its negative counterpart. It is worth noting that in Figure 1 we had a larger sample size of 100, 000 and stricter quantile threshold of 0.998 resulting in smaller number of observations considered as extremes, but with an empirical distribution visibly closer to the prescribed asymptotic discrete distribution. Therefore the simulation scenario considered here is more difficult. Figure 9 illustrates the convergence of the method. While the spectral k-means method of Janßen and Wan (2020) performs slightly better than our spectral clustering for Nn 200, our proposed method appears better with much smaller variability for a larger number of extremes. The choice kn of nearest neighbors did not appreciably impact the performance of spectral clustering across the different sample sizes. Spectral learning of multivariate extremes C=3 C=5 C=7 C=9 2.8 3.0 3.2 3.4 3.6 MA(3), N=100 Frobenius norm C=3 C=5 C=7 C=9 2.8 3.0 3.2 3.4 MA(3), N=200 C=3 C=5 C=7 C=9 2.6 2.8 3.0 3.2 MA(3), N=400 C=3 C=5 C=7 C=9 2.4 2.6 2.8 3.0 3.2 MA(3), N=800 Figure 9: Estimation error of the matrix of atoms of the spectral measure of the symmetric MA(3) model. The sample sample sizes were n = {1000, 5000, 25000, 125000} giving Nn = {100, 200, 400, 800}. We used kn = Nn C log Nn + 1 nearest neighbor graphs. 6.3 Air pollution data We revisit the data analyzed by Heffernan and Tawn (2004) and Janßen and Wan (2020). It is available in the R package texmex and consists of daily measurements of five air pollutants in the city of Leeds, UK. It was collected between 1994 and 1998, and split into summer and winter months yielding a total of 578 and 532 observations respectively. Following standard practice in multivariate extremes data analysis we standardize the marginal distribution of the data to focus on the extremal dependence. More specifically, we transform the marginals of the original observations Xi as in Janßen and Wan (2020) i.e., we let Yij := 1/{1 Fnj(Xij)}, where Fnj(x) = 1 n Pn i=1 1(Xi x) denotes the jth marginal empirical cumulative distribution function, x R and j = 1, . . . , d. We then proceed to define the extremal observations as the 10% of the transformed observations {Yi} with largest Euclidean norm and analyze their angular components with our algorithm. We analyze this data using spectral clustering with the exponential kernel and s = 1 as in the simulated data. The screeplots in Figure 10 suggest that one should consider 5 clusters for this data. Figure 11 shows the estimated cluster centers cj for j = 1, . . . , 5. We note that the elbow plot considered by Janßen and Wan (2020) suggested the authors to use 4 or 5 clusters in their article. Our results for 5 clusters is consistent with their analysis. Specifically, the normalized cluster centers in the heat map of Figure 11 show that the extremes of the five air pollutants act mostly independent. Looking a bit more closely, both NO and NO2 share common strength in clusters 2 and 3, which is much stronger in winter than in summer. PM10 also shares a common source (cluster 2) with NO and NO2, which is more pronounced in winter than summer. For the O3 and NO2 pollutants, we examined time lagged dependence by applying the spectral clustering algorithm to the vector Xt = (Xt, Xt 1, Xt 2, Xt 3)T , where Xt represents either the measured value of O3 or NO2 Avella Medina, Davis and Samorodnitsky 0 5 10 15 20 25 30 0 10 20 30 40 combined data, s=1 eigenvalues 0 5 10 15 20 25 30 0 5 10 15 20 summer data, s=1 eigenvalues 0 5 10 15 20 25 30 0 5 10 15 20 winter data, s=1 eigenvalues Figure 10: Screeplots of the kernel matrix of the air pollution data extremes obtained with the exponential kernel and bandwidth parameter s = 1 . on day t. The resulting heat plots for the cluster centers (4) are displayed in Figures 12 (O3) and 13 (NO2). The super and sub diagonals reflect some extremal dependence at time lag 1 for O3 in both summer and winter. This dependence mostly dissipates after one day. The situation for NO2 is a bit more complex. One still discerns some extremal dependence at a one day lag as indicated by the high-temperature in the heat maps along the diagonal and subdiagonal. However, some clusters have similar shading for its center of mass, e.g., clusters 1 and 4 for winter, which suggests poor delineation between the clusters. In addition, there is a stronger day effect in the summer than winter for NO2 and the dependence does not necessarily die out after one day lag as in the O3 case. 1 2 3 4 5 Clusters (a) summer 8 1 2 3 4 5 Clusters Figure 11: Five dimensional extremes from air pollution summer and winter data. The heat maps show the estimated cluster centers using spectral clustering with 5 clusters and 9-nearest neighbors. Spectral learning of multivariate extremes 1 2 3 4 Clusters 1 2 3 4 Clusters Figure 12: Four dimensional time series data constructed with lags 0-3 of O3 for summer and winter data respectively. The heat maps show the estimated cluster centers using spectral clustering with 5 clusters and 9-nearest neighbors. 1 2 3 4 Clusters 1 2 3 4 Clusters Figure 13: Four dimensional time series data constructed with lags 0-3 of NO2 for summer and winter data respectively. The heat maps show the estimated cluster centers using spectral clustering with 5 clusters and 9-nearest neighbors. Avella Medina, Davis and Samorodnitsky 7. Discussion In this work we introduced a spectral clustering approach for learning the angular measure of multivariate extremes. We proved that this approach leads to consistent clustering for a natural linear factor model and showed the good finite sample performance of our methods in numerical experiments. The encouraging results suggest the method might be applied in more general contexts. We are particularly interested in exploring two type of extensions. First, high dimensional scenarios where the dimension of the extremes d might be larger than the number of observed extremes Nn. This would require introducing appropriate notions of sparsity and regularization. Second, it seems natural to investigate generative models that lead to continuous angular measures in the limit. This scenario implies one would need to carefully introduce more general definitions of extremal clusters and different analysis of the convergence of k-nearest neighbor graphs. Acknowledgments This research was partially supported by NSF grants DMS-2015379 (Avella Medina and Davis) at Columbia and DMS-2015242 (Samorodnitsky) at Cornell. Before proving Lemmas 10 and 11 we will give a result regarding random partitions of uniform random variables that we will leverage as the continuity of FTn implies that FTn(Tn) Unif(0, 1). We remind the reader that in our proofs c > 0 represents a finite and non-zero constant whose value may change from line-to-line. Lemma 16 Let U1, . . . , UN iid Unif(0, 1) and consider the random partition of the unit interval Ij,N = [j 1 m N , j m N ), where m N = N τ log(N), τ > 1 and j = 1, . . . , m N. Then, with probability at least 1 N1 τ τ log(N)(1 + N 0.2τ) (i) Every Ij,N contains at least one of the variables U1, . . . , UN. (ii) No Ij,N contains more than 3τ log(N) of the variables U1, . . . , UN. Proof Consider the event EN,1 = {Every Ij,N contains at least one of the variables U1, . . . , UN} and note that a union bound gives j=1 P(Uk / Ij,N, k = 1, . . . , N) 1 m Ne N/m N τ log(N) . (46) Spectral learning of multivariate extremes Now consider the event EN,2 = {No Ij,N contains more than 3τ log(N) of the variables U1, . . . , UN}. It follows again from a union bound that yields j=1 P(Ij,N has more than 3τ log(N) of the U1, . . . , UN) = 1 m NP(SN > 3τ log N), where SN Bin(N, 1 m N ). Invoking Bernstein s inequality we see that P(EN,2) 1 m Ne 1 2 (3τ log N τ log N)2 τ log N+2τ log N/3 = 1 N (1.2τ 1) τ log N . (47) Combining (46) and (47) shows that (i) and (ii) hold with the desired probability. Proof of Lemma 10 We prove the lemma for positive Z. The same type of arguments work in the symmetric case and are therefore omitted. Note that we can write for t > 0, f Tn(t) = w2 1 cpun Z h z1f Z(z1) f Z(z2) f Z(zp 1)f Z tz1w2 1/un (c2z2 + + cp 1zp 1) /cp 1 z1 > un/w1/α 1 , (a11z1 + a12z2 + + a1p zp)2 + (a21z1 + a22z2 + a2p zp)2 > u2 n i dz1 dzp 1 (48) h P (a11Z1 + + a1p Zp)2 + (a21Z1 + + a2p Zp)2 > u2 n, Z1 > un/w1/α 1 i := Mn(t)/Dn , where zp = (tz1w2 1/un (c2z2 + + cp 1zp 1) /cp, ci = a2i, i = 1, . . . , p. We already know that Dn cu α n , n . (49) Next, from (36), supz f Z(z) = M < , we conclude by (37) that Mn(t) Mw2 1 cpun un/w1/α 1 z1f Z(z1) dz1 cu α n , as n . Hence there exists an G (0, ) such that for all n large enough, f Tn(t) G for all t. (50) This shows (i). Let us now turn to claim (ii) for concreteness consider 0 < t 1. Note that, for large n, the indicator in (48) is bounded from below by the indicator of the set Avella Medina, Davis and Samorodnitsky E = {C 1un < z1 < Cun, |zi| 1, i = 2, . . . , p 1} for some large C. Then, on E, the argument of the last function f Z in (48) is within a compact interval, so we obtain Mn(t) c Mw2 1 cpun C 1un z1f Z(z1) dz1 cu α n , where the last relation follows from a direct application of (37). Along with (49) this establishes (ii). Finally, note that Mn(t) w2 1 cpun un/w1/α 1 z1f Z(z1) Z R f Z(z2) f Z(zp 1) h f Z tz1w2 1/un (c2z2 + + cp 1zp 1) /cp i dz1 dzp 1 un/w1/α 1 z1f Z(unz1) Z R f Z(z2) f Z(zp 1) h f Z tz1w2 1 (c2z2 + + cp 1zp 1) /cp i dz1 dzp 1 . Using the upper bound in (37) it is easy to see that for some c > 0 and sufficiently large t, Z R f Z(z2) f Z(zp 1) h f Z t (c2z2 + + cp 1zp 1) /cp i dz2 dzp 1 (51) Indeed, the integral is, up to a constant, equal to the density of a linear combination of Z1, . . . , Zp 1. Therefore, for all y large enough, uniformly in n, 1/w1/α 1 z1f Z(unz1)(tz1) (α+1) dz1 cu α n t (α+1) , where once again we have used the upper bound in (37). Together with (49) this shows the upper bound in (iii). The lower bound in (iii) can be established in an identical way using the lower bound in (37). Proof of Lemma 11 It follows from Lemma 16 and Lemma 10 (ii) that, outside of an event Ω(1) n with P Ω(1) n 0, each one of the intervals Ii,n contains at least one of the points Tni = a21Z( ,1) 2,i + + ap1Z( ,1) p,i w2 1Z( ,1) 1,i /un , i = 1, . . . , N(1) n , and none of the intervals contains more than 3τ log N(1) n of these points. Note that (50) implies that t F 1 Tn (t) = 1 f Tn(F 1 Tn (t)) 1 G, t [0, 1] Spectral learning of multivariate extremes and hence by the fundamental theorem of calculus This shows that the length of the intervals Ii,n satisfies |Ii,n| ln/G, i = 1, . . . , mn, (52) where ln = 1 mn . Since the conditional law of (Z1/un, Z2, . . . , Zp) given (35) converges weakly, as n , to the law of Wα/w1, Z2, . . . , Zp as defined in Theorem 2, we see that FTn G := the law of c2Z2 + + cp Zp It follows that the values FTn(t0) converge, as n , to a finite limit. Therefore, there is 0 < δ < 1 such that F 1 Tn (i 1)/mn t0 for all n large enough and all i (1 δ)mn. We conclude by Lemma 10 (iii) that for such n and i, ln =FTn F 1 Tn i/mn FTn F 1 Tn (i 1)/mn (53) D 1, D Z F 1 Tn i/mn F 1 Tn (i 1)/mn t (α+1) dt. Furthermore, Z F 1 Tn i/mn F 1 Tn (i 1)/mn t (α+1) dt F 1 Tn i/mn (α+1) F 1 Tn i/mn F 1 Tn (i 1)/mn , (54) while leveraging again Lemma 10 (iii) we see that F 1 Tn i/mn f Tn(t) dt D Z F 1 Tn i/mn t (α+1) dt = D F 1 Tn i/mn α . (55) Combining (54) and (55), we conclude that Z F 1 Tn i/mn F 1 Tn (i 1)/mn t (α+1) dt c mn i (α+1)/α F 1 Tn i/mn F 1 Tn (i 1)/mn , and so by (53), (α+1)/α |Ii,n|. Since an upper bound can be obtained in the same way, we conclude that for some D1 1, for all n large enough and all i (1 δ)mn, (α+1)/α |Ii,n| D1ln (α+1)/α . (56) Avella Medina, Davis and Samorodnitsky It follows from (56) that for K0 fixed , |i j| K and all (1 δ)mn i, j mn K0, |Ii,n| |Ij,n| D 2 1 α C1 , (57) where the second to last inequality holds for sufficiently large mn. A similar argument can be used to get the upper bound |Ii,n| |Ij,n| D2 1 α C2 . (58) It follows that for K0 fixed and large enough n, |Ij,n| C2 . (59) Therefore, using (59), |i j| K and choosing K 2C2/C1, |Ii,n| + |Ii+1,n| 2C2|Ii 1,n| K|Ii k,n|, (60) where the last inequality holds for all k = 1, . . . , K. Therefore dividing (60) by K and summing over k yields |Ii,n| + |Ii+1,n| < k=1 |Ii k,n| (61) Note that (61) is sufficient to guarantee that any point in Ii,n is closer to any point in Ii+1,n that to any point in an interval Ij,n with j < i K. A similar argument shows that any point in Ii,n is closer to any point in Ii 1,n that to any point in an interval Ij,n with j > i + K. We conclude that, outside of the the event Ω(1) n , in a kn-NN graph with kn > 3(K + 1)τ log N(1) n , (62) then all points Tnj, j = 1, . . . , N(1) n within Ii,n for some i in the range (1 δ)mn i mn K0 will be connected both to each other and to such a point in each Ii 1,n and Ii+1,n. The next observation to make is that, as long as δ is small enough, the sequence F 1 Tn (1 δ) is bounded from above. Therefore, by Lemma 10 (ii), uniformly in large enough n, the density f Tn is bounded from below by, say, a > 0 on the interval 0, F 1 Tn (1 δ) . Therefore, for all large enough n, |Ii,n| ln/a, 1 i (1 δ)mn. (63) To see this, note that t F 1 Tn (t) = 1 f Tn(F 1 Tn (t)) 1 a, t [0, 1] and hence by the fundamental theorem of calculus |Ii,n| = F 1 Tn Spectral learning of multivariate extremes It follows from (52) and (63) that if K > 2G a then any point in Ii,n is closer to any point in Ii 1,n and in Ii+1,n than to any point in an interval Ij,n with j < i K or with j > i + K. Therefore, on the event Ω(1) n , in a kn-NN graph satisfying (62), all points Tnj, j = 1, . . . , N(1) n within Ii,n in the range 1 i (1 δ)mn will be connected both to each other and to such a point in each Ii 1,n and Ii+1,n. Indeed, to show this it suffices to show again that (61) holds true in the range 1 i (1 δ)mn. It is easy to see that (52), (63) and K > 2G |Ii,n| + |Ii+1,n| 2ln k=1 |Ii k,n|. Finally, it is obvious that if K > K0, then on the same event Ω(1) n , in a kn-NN graph satisfying (62), all points Tnj, j = 1, . . . , N(1) n within Ii,n in the range mn K0 < i mn will be connected both to each other and to a such a point in each Ii 1,n and Ii+1,n. Summarizing the above discussion we conclude that on the event Ω(1) n , in a kn-NN graph satisfying (62) with K large enough, all points Tnj, j = 1, . . . , N(1) n within Ii,n in the entire range 1 i mn will be connected both to each other and to a such a point in each Ii 1,n and Ii+1,n. In particular, the kn-NN graph will be connected. We now translate this discussion to the random vectors M(i), i = 1, . . . , N(1) n . We define intervals along vector b by Ji,n = Ii,nb, = 1, . . . , N(1) n . Then, outside of the event Ω(1) n , each one of these intervals contains at least one of the points M(i), i = 1, . . . , N(1) n and none of the intervals contains more than 3τ log N(1) n of these points. By (52) the lengths of these intervals satisfy for some G1 > 0, |Ji,n| ln/G1, i = 1, . . . , mn . We finally note that by (23), with probability tending to one Nn Cnu α n , and therefore G > 0 and n large enough ensure that (62) holds provided kn > G log n. This concludes the proof. Proof of Theorem 12 Lemma 11 gives us the connectivity of the extremal kn-NN graph for kn satisfying (62) with K large enough. The next step is to understand by how much the points M(i), i = 1, . . . , N(1) n are shifted by adding to them D(i), i = 1, . . . , N(1) n in (33). Denote Ω(2) n = Bc n as defined in Lemma 5. Then P Ω(2) n 0 as n and it is elementary to check that outside of Ω(2) n we have D(i) ch2 n/un for all i = 1, . . . , N(1) n . Recall that by the choice of hn we have h2 n/un = o(ln) as n . If we define new sets by Ji,n = M(j) + D(j) : M(j) Ji,n , i = 1, . . . , mn, Avella Medina, Davis and Samorodnitsky then it follows immediately that for large n, outside of the event Ω(2) n , the new sets have the property described by Lemma 11, perhaps with a larger K0. We already know that this means that for large n, outside of Ω(1) n Ω(2) n , the extremal kn-NN graph with kn satisfying (62) with K large enough, is connected. Bojan Basrak, Richard A Davis, and Thomas Mikosch. A characterization of multivariate regular variation. Annals of Applied Probability, 12(3):908 920, 2002. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. Leonard Breiman. On some limit theorems similar to the arc-sin law. Theory of Probability & Its Applications, 10(2):323 331, 1965. Emilie Chautru. Dimension reduction in multivariate extreme value analysis. Electronic Journal of Statistics, 9(1):383 418, 2015. St ephan Cl emen con, Hamid Jalalzai, St ephane Lhaut, Anne Sabourin, and Johan Segers. Concentration bounds for the empirical angular measure with statistical learning applications. Bernoulli, 29(4):2797 2827, 2023. Daniel Cooley and Emeric Thibaud. Decompositions of dependence for high-dimensional extremes. Biometrika, 106(3):587 604, 2019. Richard A Davis and Sidney I Resnick. Basic properties and prediction of max-arma processes. Advances in Applied Probability, 21(4):781 803, 1989. Anthony C Davison and Rapha el Huser. Statistics of extremes. Annual Review of Statistics and its Application, 2:203 235, 2015. Inderjit S Dhillon and Dharmendra S Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143 175, 2001. Holger Drees and Anne Sabourin. Principal component analysis for multivariate extremes. Electronic Journal of Statistics, 15(1):908 943, 2021. Paul Embrechts and Charles M Goldie. On closure and factorization properties of subexponential and related distributions. Journal of the Australian Mathematical Society, 29 (2):243 256, 1980. Sebastian Engelke and Jevgenijs Ivanovs. Sparse structures for multivariate extremes. Annual Review of Statistics and Its Application, 8:241 270, 2021. Nadine Gissibl and Claudia Kl uppelberg. Max-linear models on directed acyclic graphs. Bernoulli, 24(4A):2693 2720, 2018. Nicolas Goix, Anne Sabourin, and St ephan Cl emen. Learning the dependence structure of rare events: a non-asymptotic study. In Conference on Learning Theory, pages 843 860. PMLR, 2015. Spectral learning of multivariate extremes Nicolas Goix, Anne Sabourin, and Stephan Cl emen con. Sparse representation of multivariate extremes with applications to anomaly detection. Journal of Multivariate Analysis, 161:12 31, 2017. Peter Hall, Liang Peng, and Qiwei Yao. Moving-maximum models for extrema of time series. Journal of Statistical Planning and Inference, 103(1-2):51 63, 2002. Janet E Heffernan and Jonathan A Tawn. A conditional approach for multivariate extreme values (with discussion). Journal of the Royal Statistical Society: Series B, 66(3):497 546, 2004. Bruce Hendrickson and Robert Leland. An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM Journal on Scientific Computing, 16(2):452 469, 1995. Anja Janßen and Phyllis Wan. k-means clustering of extremes. Electronic Journal of Statistics, 14(1):1211 1233, 2020. Claudia Kl uppelberg and Steffen Lauritzen. Bayesian networks for max-linear models. In Network Science, pages 79 97. Springer, 2019. Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1):215 237, 2015. Nicolas Meyer and Olivier Wintenberger. Sparse regular variation. Advances in Applied Probability, 53(4):1115 1148, 2021. Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849 856, 2002. Sidney I Resnick. Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science & Business Media, 2007. Sidney I Resnick. Extreme values, regular variation and point processes. Springer, New York, 2018. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. Annals of Statistics, 39(4):1878 1915, 2011. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000. Emma S Simpson, Jennifer L Wadsworth, and Jonathan A Tawn. Determining the dependence structure of multivariate extremes. Biometrika, 107(3):513 532, 2020. Rafael Van Driessche and Dirk Roose. An improved spectral bisection algorithm and its application to dynamic load balancing. Parallel Computing, 21(1):29 48, 1995. Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4): 395 416, 2007. Avella Medina, Davis and Samorodnitsky Zhixin Zhou and Arash A Amini. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. Journal of Machine Learning Research, 20(1): 1774 1820, 2019.